I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example:
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.')
I've looked at amatch
in the stringdist
function, as well as agrep
, but these all require a master list of names that are used to match another list of names against. In my case, I don't have such a master list so I'd like to create one from the data by identifying names with highly similar patterns so I can look at them and decide whether they're the same person (which in many cases they are). I'd like an output in a new column that helps me to know these are a likely match, and maybe some sort of similarity score based on Levenshtein distance or something. Maybe something like this:
names match SimilarityScore
1 jon smith a 9
2 jon, smith a 8
3 Jon Smith a 9
4 jon smith et al a 5
5 bob seger b 9
6 bob, seger b 8
7 bobby seger b 7
8 bob seger jr. b 5
Is something like this possible?
Drawing upon the post found here I have found that hierarchical text clustering will do what I'm looking for.
The output looks really good if you pick the right number of clusters (three in this case):
However, names are oftentimes more complex than this, and after adding a few more difficult names, I found that the default
adist
options didn't give the best clustering:I was able to improve upon this by increasing the cost of the substitution value to 2 and leaving the insertion and deletion costs at 1, and ignoring case. This helped to to minimize the mistaken grouping of totally different four character number strings, which I didn't want grouped:
I further fine tuned the clustering by removing common terms such as "ranch" and "et al" using the
gsub
tool in thegrep
package and increasing the number of clusters by one:Although there are methods to let the data sort out the best number of clusters instead of manually trying to pick the number, I found that it was easiest to use trial and error, although there is information here about that approach.