Matching fuzzy strings

1.3k views Asked by At

I have two tables that I need to merge together in PostgreSQL, on the common variable "company name." Unfortunately many of the company names don't match exactly (i.e. MICROSOFT in one table, MICROSFT in the other). I've tried removing common words from both columns such as "corporation" or "inc" or "ltd" in order to try to standardize names across both tables, but I'm having trouble thinking of additional strategies. Any ideas?

Thanks.

Also, if necessary I can do this in R.

1

There are 1 answers

1
Anders Marzi Tornblad On

Have you considered the fuzzystrmatch module? You can use soundex, difference, levenshtein, metaphone and dmetaphone, or a combination.

fuzzystrmatch documentation

SELECT something
FROM somewhere
WHERE levenshtein(item1, item2) < Carefully_Selected_Threshold

For example the levenshtein distance from MICROSOFT to MICROSFT is one (1).

levenshtein(dmetaphone('MICROSOFT'), dmetaphone('MICROSFT')

The above returns zero (0). Combining levenshtein and dmetaphone could help you match lots of misspellings.