How to check string similarity from two tables with multiple words input in BigQuery

180 views Asked by At

I have two tables containing a list of company names. The first one would be the index table therefore the value inside would be clean and the format is consistent. The second table contains user-input company information and therefore there might be typos and format inconsistencies.

The first one (the index table) looks similar to this:

company_name                         |      industry
Apple Inc                                   Technology
Amazon Inc                                  Retail
Kraft Heinz                                 Food Processing
New York Life Insurance Company             Insurance
Tesla Inc                                   Tesla
Walmart Inc                                 Retail

The second table (user input table) looks similar to this

company_name
Apple
Apple Inc.
Amazon, Inc
Kraft
New York Life
Tsla
Walmart

Notice that the second table does not have the industry column since the main goal would be to add the industry column to the second table, but since there is no key we can't simply join the two tables.

I think the first step would be to compare the similarity between the company_name and if it is similar enough we can just assume it is the same.

I have done some research and I think we would need to use Levenshtein Distance. Honestly, I am not very familiar with that method, but from my limited understanding, I wonder if it is applicable for inputs containing multiple words (e.g New York Life Insurance Company) or it is just effective for one-word input (e.g Apple).

Any suggestions or guidance would be greatly appreciated.

0

There are 0 answers