I am trying to use the recordLinkage package to link together two datasets where one dataset tends to give multiple last / middle names and the other just gives a single last name. Currently the string comparison function that's being used is the Jaro-Winkler function however the score returned is dependent on how the strings are matching up by chance instead of if the content of the shorter string is contained anywhere in the longer string. This is leading to many poor quality links being created. A reproducible example of the wrong weightings are as follows:
library(RecordLinkage)
data1 <- as.data.frame(list("lname" = c("lolli gaggen nazeem", "lolli gaggen nazeem", "lolli gaggen nazeem"),
"bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )
data2 <- as.data.frame(list("lname" = c("lolli", "gaggen", "nazeem"),
"bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )
blocking_variable <- c("bday")
pass <- compare.linkage(data1, data2, blockfld = blocking_variable, strcmp = T)
pass_weights <- epiWeights(pass)
getPairs(pass_weights, single.rows = TRUE)
id1 lname.1 bday.1 id2 lname.2 bday.2 Weight
1 1 lolli gaggen nazheem 1908-08-08 1 lolli 1908-08-08 0.9162463
2 2 lolli gaggen nazheem 1979-12-12 2 gaggen 1979-12-12 0.8697165
3 3 lolli gaggen nazheem 1560-06-06 3 nazheem 1560-06-06 0.6995502
I want id's 2 & 3 to receive roughly the same weightings as id #1 however currently they are much lower since their last names are not in the exact same position in both datasets (although the content is agreeing). Is there a way I can modify the string comparison function being used here / the structure of the data so that I can take account of the different orderings?
Additional Notes:
Both datasets have millions of rows so memory efficiency is definitely important here!
Sometimes the other dataset may have more than just a single last name so we'd be comparing 3 words against 2 words - would probably be best to start off with tackling the easy case first though
- More often than not there will be spelling differences of the names between the two datasets
- Currently we are using IBM's quality stage to do this linking and they use the "MULT_UNCERT" comparison function (https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.7.0/com.ibm.swg.im.iis.ds.design.help.doc/topics/r_qresfgde_MULT_UNCERT_comparison.html). I want to replicate this in R.
Have you thought about the following approach?
Record linkage and names are as I know you would know, difficult. Ideally you want to block on other available information (gender, unique identifiers, dob, location information etc.) and then do string comparisons on the names.
You mention large datasets with millions of records. Look no further than the
data.table
package by the great Matt Dowle (https://stackoverflow.com/users/403310/matt-dowle).The RecordLinkage package is slow in comparison. You could easily improve the below code to think about string hashing techniques using soundex, double metaphone, nysiis etc.