Grouping similar strings that have misspellings, spacing differences, etc

50 views Asked by At

I have a data set of about 1 million employer names. These names are from a free-form text field so they include misspellings and variations in the way they are inputted (e.g. "Amazon" .. "Amzaon" .. "Amazon.com" .. "Amazon Web Services" .. "AWS").

I want to either A) group these 1 million so I have a somewhat accurate sense of how many unique employers are in the data set or B) be able to find all variations of any given employer.

So far, I've been using the data in Tableau, then filtering on "employer name" and searching all variations of the name I can think of. But it's tedious and I'm pretty sure I'm leaving many out.

I've also used the fuzzy add-in for excel but it hasn't worked that well on misspellings, special characters...

2

There are 2 answers

0
Alex Blakemore On

Try experimenting with Tableau Prep Builder - the companion tool that comes with your Tableau Creator license. It has a group feature that is designed for just these problems.

In Prep Builder, you’ll just need to connect to your data, add a cleaning step, and then add a group to your cleaning step.

0
Tom On

Tableau just isn't suited for doing this kind of analysis straight out of the box, and I would highly recommend doing some pre-processing on your data before putting trying to build a workbook around it.

Like another commenter said, you could look into using Tableau Prep Builder for a one-time transformation on your data set, but if you wanted to automate this process it costs extra to add functionality to whatever Tableau Server installation you have.

If you're familiar with Python or R (and the integration between Tableau Server and those services is supported by your organization), you could look into building a script to run the transformation real-time, but it probably won't be too efficient.