I have two tables containing a list of company names. The first one would be the index table therefore the value inside would be clean and the format is consistent. The second table contains user-input company information and therefore there might be typos and format inconsistencies.
The first one (the index table) looks similar to this:
company_name | industry
Apple Inc Technology
Amazon Inc Retail
Kraft Heinz Food Processing
New York Life Insurance Company Insurance
Tesla Inc Tesla
Walmart Inc Retail
The second table (user input table) looks similar to this
company_name
Apple
Apple Inc.
Amazon, Inc
Kraft
New York Life
Tsla
Walmart
Notice that the second table does not have the industry
column since the main goal would be to add the industry
column to the second table, but since there is no key we can't simply join the two tables.
I think the first step would be to compare the similarity between the company_name
and if it is similar enough we can just assume it is the same.
I have done some research and I think we would need to use Levenshtein Distance. Honestly, I am not very familiar with that method, but from my limited understanding, I wonder if it is applicable for inputs containing multiple words (e.g New York Life Insurance Company) or it is just effective for one-word input (e.g Apple).
Any suggestions or guidance would be greatly appreciated.