Join our newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
How To
Text Clustering
Some time ago, I was left with a puzzler that nobody wanted to touch. Even though I did not have much experience with the issue, I went for it to learn something new. And I enjoyed it at the end! Clus

What was it all about?

What was it all about?

One of our potential clients gets information from various places about how their products are selling. They would like to merge this data together and add their own sales data to it. But it’s not that easy – every source calls the same product a little differently.

Let’s take a closer look at the product “LEGO Friends 3061: City Park Cafe”. This could look like this in the data:

LEGO Friends - formatting, packaging, distribution, region, promotions …

All labels, notes, distribution channels, etc. can be misspelled, and suppliers can combine the info in many ways. In addition, there are hundreds of LEGO products as well as their Chinese imitations available on the market.

What was my goal?

I wanted to simplify the data and create a “pairing list” as a connecting block of the information obtained from different vendors. In other words, I wanted to be able to identify sales reported as “Lego Friends 3061: City Park Cafe” and “lego friends / 3061 / City-Park Cafe” as the same product. There are hundreds of LEGO products in the client’s data, and this is only an example of how a linking block for “LEGO Friends: City Park Cafe” can look like. The input data is matched against the blue columns, the green columns are the “standard” names to do the aggregation of sales numbers:

How did I want to do it?

I considered seven different approaches:

  1. Go through it manually and make a list as a trained monkey would do.
  2. Try to let train entity detection and get the green lines from them.
  3. Try it with
  4. Apply a text token (n-gram).
  5. Apply phonetic algorithms.
  6. Use the Levenshtein distance. And finally,
  7. Implement the principle of DNA sequencing – Kolmogorov complexity.

I liked the first option the best, but it would take a long time, and my error rate would change (i.e. the quality of the list at the time of production). Also, it would not be possible to correct it from the beginning if I discovered in the middle of my attempt that some products were connected wrongly. And finally, this would be difficult to extend with other data sources.

I did not want to involve the guys from Geneea (option 2) because I did not have the cash to pay them. Trifacta (option 3) seemed to me more like a (slow) clicking generator of very erroneous rules.

I decided to try the remaining points (4–7) using OpenRefine, the successor to GoogleRefine. For production deployment, you can choose to manage the instance from – I wrote about it here.

Text Clustering

If I get a fingerprint from each product name that is unique to the product, I will always have a chance to have different texts of the same meaning for a single fingerprint.


The n-gram algorithm is the so-called tokenization algorithm that first simplifies the text, turns everything into lowercase, removes punctuation, spaces, and characters such as dashes, commas, semicolons, etc.

It breaks the text to “n-grams”. If I do n-gram = 2, the text will be broken down by two characters: Praha = pr, ra, ah, ha.

These n-character combinations are sorted, and duplicates are thrown out. The resulting sequence of n-grams is joined together; we get “ahhaprra” from Praha, Kokotín gives us “inkookotti” (if I do 1-gram, Praha will be “ahpr” and Kokotín “iknot”).

This is great for misspellings in real-life typing. Not so much for nonsense, abbreviations, etc. because “abc” and “bca” have the same 2-gram (ab, bc), but it’s different. However, Wikipedia says it has been empirically found that if two real texts have the same “vector” (n-gram array: {“ab”, “bc”}), then it is very likely the same text. I definitely wanted to try this!

Phonetic algorithm

OpenRefine contains an implementation of the Metaphone3 algorithm. It makes a fingerprint based on how it sounds in English and has a ton of rules:

For my use case, the results were pretty accurate!

Levenshtein’s distance

Sometimes it is called “edit distance”, and it’s super easy: for every two strings of text it gives you a number that is equal to the number of change operations you have to turn one text into the other.


_CoffeeBeer = 4CoarseCourse = 1Lego Friends 3061: City Park CafeLego-Friends 3061: City Park Cafe = 1_

The larger distance you tolerate, the greater the chance of considering equal two texts that are actually different. A small distance “guarantees” the correction of typos, misspellings, etc.

Kolmogorov complexity

It is called PPM in OpenRefine. When I found out it was used to search for similarities in the DNA-derived strings I could not go to sleep until I tried it!

This algorithm is based on the idea that text compression “somehow” (mathematically) works with the information value of the compressed text. If I join A + B one after another and compress it, the resulting size (A + B) will de facto be the same as the compression itself (A).


I took the previous paragraph as both document A and document B, saved it once in document “docA.txt” and then saved it twice (A + B) into document “docAB.txt”. I compressed both documents with gzip, docA was compressed to 239b and docAB to 249b. The difference was 10 characters even though docAB.txt is 271 characters longer. The difference in compression of two identical documents compared to compression of one (identical) document thus makes 3.6%. However, if A were the first paragraph of the description of this algorithm and the second paragraph were B, the difference in compression of A and compression of A + B would be 53.1%.

Therefore, the PPM algorithm in OpenRefine takes two different texts and applies compression in various combinations, from which it always gets the size. The different coefficient is calculated according to the following formula:

In my example of identical A and B, the formula would look like this:

Finally, if A is the first paragraph of this description and B is the second paragraph, the pattern will be as follows:

It is good to note that this approach has better results in detecting similarities if the text is longer. You can take notes from the Scout Chronicle and sort them by close similarity to find that they have similar stories, although this is, of course, always a completely different text!

Let’s get to work!

I will not try to describe how to install and work with OpenRefine – I’ll leave it for the official screencasts.

As soon as my data was uploaded, I started by duplicating the product column. The goal is to change the copy of the values so that I will get a pair “original text” <> “clustered text” at the end:

Then I chose “Face Faceting” on the new column and pressed the “Cluster” button in the left box:

A dialog box appears where you can select the method (tokenization / similarity) and the function (n-gram / metaphone / …). It is possible to change the parameters. OpenRefine applies the calculation for you and displays suggestions of which rows from our column belong together. It is good to understand algorithms and play with them. You do not have to be a mathematician who can implement it in C ++, but it is good to know that for Levenshtein’s distance, for instance, you should not use Radius=20 because it will replace “Pork Feast” for “Pheasant Race”.

If the estimated Cluster looks okay to you, check “Merge?” and name the values for that cluster. Whenever you go to “Merge Selected & Re-Cluster”, what you set up is applied and everything is calculated again. I first made clusters using sensitive tokenization algorithms (n-gram = 1) and then went to PPM / Metaphone / Levenshtein (radius>1)… When there is nothing left to cluster (or it just does not progress further), close the window. In my simple LEGO example, the result is just like a fairy tale:

Imagine that there are thousands of rows in the left column describing dozens of products – then in the right column, we want to have dozens of clusters. The use is obvious: the sales data from many different sources will be joined here (the key is the source column + the original product column), and the values ​​from the OpenRefine column link everything that belongs to one product, regardless of how messy the descriptions in the data sources are.

What did I learn?

The quality of my data is really bad. It is good to try and clean it a little bit for better clustering. I used UPPER () in SQL to convert all the characters into uppercase, removed clear nonsense (region information, type of packaging, etc.), and deduplicated the rows – I managed to get the initial tens of thousands of items down to about 1,500. I needed to delete certain parts of the texts, e.g. “ABC”, but only at the end of lines (i.e., I kept “ABC Special”, but deleted the last “ABC” part in “ABC Special ABC”). This is a great fit for regular expressions. Do not be frightened by it! In the DB, there is a REGEXP_REPLACE that would remove “DHL” and “PPL” from the end of the line as follows:

The “$” character says it matches only if it’s at the end of the line (so I usually TRIM the column to not have a space at the end and not need the regexp pattern to be ‘DHL $PPL $’).

The pipe (“”) separates the array of values ​​I am looking for (so I do not have to call REPLACE as many times as I want to change.) Matches will be replaced with nothing (‘’). I have quite a number of such rules. They simplify my work in OpenRefine – less data, better accuracy. If you put “^” in front of a string instead of “$”, it means that it only searches at the beginning of the line. Most of the rules that you may need are easy to find with Google. Pay attention to regular expressions – it’s a very powerful thing!

I saved the best for last: In OpenRefine, each action generates a rule. To view the rules, use the switch at the upper left corner below the logo in the “Undo / Redo” menu:

The whole definition of clustering is interactive – you do not want the machine to decide for you in this case – but once you define it, you can easily repeat it. To do this, just export the commands / rules (definitions in JSON):

If you copy them (my cfg from this blog post is here), you can take the original data as it looks when it’s loaded into OpenRefine and apply these rules directly to it – it’s not necessary to click it all again. This leads to full automation (the final question of the FAQ), which is a bit of a hack. Does anyone have experience with it?

I hope that it will encourage you to experiment with OpenRefine. It is a very powerful tool and it pays to learn to use it – at least its basics. You can find videos showing how to call an external API to enrich your cell values, etc. on the Internet.

Check out this cheat sheet.


Before joining OpenRefine, I spent a lot of time carefully preparing my data. I have not stressed it enough, but the cleanup is crucial. Here’s an example of how I split the text using a rule about text logic.

I have the following ten sample lines:

I’d like to take everything to the left of a product code. But I don’t know where exactly the product code is. So I defined it as a number containing more than 3 digits.

Using this:

Lines 3-6 show a function with two parameters: column and regexp pattern – it returns a matched pattern’s position.

I’ll get:

If I wanted to find a date, I would only define another regular expression (source for reading).

The function in lines 7-13 takes the column (line 8) as one parameter and the position within the text as the second parameter position (l.9-12). Since the position starts at the point where the matched pattern is, I deduct one from the position (I.12) so that I do not have the first character of my number in the LEFT () product.

If I take everything to the left from this position using LEFT (), I will get this:

In the first row, one can calculate that “(“ is at position 14, but REGEXP_INSTR returns 15 therefore the “-1” on line 12 in SQL.

In the fifth row, I can see that I caught only a piece of the number with a dot in the middle. Being no regexp master, I finally solved it in SQL. But I’m going to write a Regex matching number and decimals rule. And I placed “replace” inside of the function to delete my dot:

Finally, I dropped all the characters I did not want to have, used uppercase, and removed any leading / trailing spaces using TRIM ():

Lines 11-20 are what’s still visible (because I’m not doing it anymore, so I dropped a list of matched positions); on lines 4-8, I replace all non-alphabetical characters, spaces (, and the word “edition” with a space, getting rid of all the mess; on line 3, I use uppercase letters for the results and cut the spaces between them with TRIM

This code returns exactly what I want:

You can see that I can play with it and iteratively tune the regexp patterns to follow the logic of the text. Only when I have nice product names in a separate column, I will send it to OpenRefine.

If the source data were not a “log” of sales, but perhaps Amazon reviews where people complain that building LEGO 3061 is no fun, I would go with my own entity detection model from BTW, this is what I’m going to do for another client of ours – I can’t wait to see how it turns out :).

Stay in touch

Download the files
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.