R string-based matching of business names

440 views Asked by At

TL;DR I'd like to match two unequal columns where the values contain business names, and I've tried stringdist's amatch using Jaro-Winkler matching to get close, but not nearly close enough. I am wondering if stringi would be useful here - I just don't quite understand how to use it, excuse my being a noob. I wouldn't ask otherwise but I don't think I'll be able to figure it out myself in time.

For context, there are 2079 business names in one column and 1878 business names in a second column. Many of these contain the business structures as suffixes - i.e. LLC, Inc., INC., Co. etc. - so I trimmed them out with excel before going into R. The names were manually entered into both columns so there are human-entry error variations.

I used this formula:

amatch(match$sales, match$box, maxDist = .25, method =c("jw"), weight = c(d = 1, i = .9, s = .9, t = .9), p= .2, matchNA = FALSE, bt=.25)

I was able to get some results with this, but many matches were duplicated because a company would share the first word, or the first combination of words/letters - i.e. "A & A" vs "A & B". I understand this is based on how the JW formula works, but I don't quite know how to modify it enough.

I need to match values in Column b to Column a. There may be duplicates and Column a. I don't have any specific rules for similarity; I want the closest match possible to each value, and a minimal number of false duplicates.

For starters, would there be an easier way to accomplish this within stringi?

Please advise, as I am unaware how to best tackle this problem moving forward. If further details are required, I'm happy to oblige. Thank you in advance.

0

There are 0 answers