approximate string matching within single list - r

2k views Asked by At

I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example:

names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.')

I've looked at amatch in the stringdist function, as well as agrep, but these all require a master list of names that are used to match another list of names against. In my case, I don't have such a master list so I'd like to create one from the data by identifying names with highly similar patterns so I can look at them and decide whether they're the same person (which in many cases they are). I'd like an output in a new column that helps me to know these are a likely match, and maybe some sort of similarity score based on Levenshtein distance or something. Maybe something like this:

            names   match      SimilarityScore
1       jon smith     a               9
2      jon, smith     a               8
3       Jon Smith     a               9
4 jon smith et al     a               5
5       bob seger     b               9
6      bob, seger     b               8
7     bobby seger     b               7
8   bob seger jr.     b               5

Is something like this possible?

2

There are 2 answers

0
Luke Macaulay On BEST ANSWER

Drawing upon the post found here I have found that hierarchical text clustering will do what I'm looking for.

  names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied')

# Levenshtein Distance
e  <- adist(names)
rownames(e) <- names
hc <- hclust(as.dist(e))
plot(hc)
rect.hclust(hc,k=3) #the k value provides the number of clusters
df <- data.frame(names,cutree(hc,k=3))

The output looks really good if you pick the right number of clusters (three in this case):

                       names             cutree.hc..k...3.
jon smith             jon smith                 1
jon, smith           jon, smith                 1
Jon Smith             Jon Smith                 1
jon smith et al jon smith et al                 1
bob seger             bob seger                 2
bob, seger           bob, seger                 2
bobby seger         bobby seger                 2
bob seger jr.     bob seger jr.                 2
jake                       jake                 3
jakey                     jakey                 3
jack                       jack                 3
jakeyfied             jakeyfied                 3

However, names are oftentimes more complex than this, and after adding a few more difficult names, I found that the default adist options didn't give the best clustering:

names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied','1234 ranch','5678 ranch','9983','7777')

d  <- adist(names)
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6)

Cluster2

I was able to improve upon this by increasing the cost of the substitution value to 2 and leaving the insertion and deletion costs at 1, and ignoring case. This helped to to minimize the mistaken grouping of totally different four character number strings, which I didn't want grouped:

d  <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2)) #i=insertion, d=deletion s=substitution
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6

enter image description here

I further fine tuned the clustering by removing common terms such as "ranch" and "et al" using the gsub tool in the grep package and increasing the number of clusters by one:

names<-gsub("ranch","",names)
names<-gsub("et al","",names)
d  <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2))
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=7)

enter image description here

Although there are methods to let the data sort out the best number of clusters instead of manually trying to pick the number, I found that it was easiest to use trial and error, although there is information here about that approach.

0
crogg01 On

The suggestion by Roman in the comments on the natural language processing is probably the best place to start. But for a back-of-the-envelope type of approach you can look at the distance in terms of ascii code:

mynames = c("abcd efghijkl mn","zbcd efghijkl mn","bbcd efghijkl mn","erqe")
asc <- function(x) { strtoi(charToRaw(x),16L) }
namesToChar= sapply(mynames, asc)
maxLength= max(unlist(lapply(namesToChar,length)))
namesToChar =lapply(namesToChar, function(x) { c(x, rep(-1, times = maxLength-length(x) )) } )
namesToChar = do.call("rbind",namesToChar)
dist(namesToChar,method="euclidean")
dist(namesToChar,method="canberra")

Though it seems to give OK enough numbers for the sample,

> dist(namesToChar,method="manhattan")
                 abcd efghijkl mn zbcd efghijkl mn bbcd efghijkl mn
zbcd efghijkl mn               25                                  
bbcd efghijkl mn                1               24                 
erqe                          257              274              256

this approach suffers from the fact that there does not seem to be an adequate distance method for the dist function for what you want to do. An element-wise binary comparison followed by a more standard distance perhaps ('manhattan' seems closest to your needs)? You could always implement that yourself of course. Also the -1 fill out is a hack here, you would need to replace that with the average ascii code of your sample if you decide to go this route.

For a similarity score versus the overall population you can take inverse of the average distance against each other word.