Compare and link strings with different word orders / word counts

Question

Compare and link strings with different word orders / word counts

342 views Asked by Maharero At 18 November 2018 at 19:07

I am trying to use the recordLinkage package to link together two datasets where one dataset tends to give multiple last / middle names and the other just gives a single last name. Currently the string comparison function that's being used is the Jaro-Winkler function however the score returned is dependent on how the strings are matching up by chance instead of if the content of the shorter string is contained anywhere in the longer string. This is leading to many poor quality links being created. A reproducible example of the wrong weightings are as follows:

library(RecordLinkage)
data1 <- as.data.frame(list("lname" = c("lolli gaggen nazeem", "lolli gaggen nazeem", "lolli gaggen nazeem"),
                           "bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )

data2 <- as.data.frame(list("lname" = c("lolli", "gaggen", "nazeem"),
                           "bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )

blocking_variable <- c("bday")
pass <- compare.linkage(data1, data2, blockfld = blocking_variable, strcmp = T)
pass_weights <- epiWeights(pass)
getPairs(pass_weights, single.rows = TRUE)

  id1              lname.1     bday.1 id2 lname.2     bday.2    Weight
1   1 lolli gaggen nazheem 1908-08-08   1   lolli 1908-08-08 0.9162463
2   2 lolli gaggen nazheem 1979-12-12   2  gaggen 1979-12-12 0.8697165
3   3 lolli gaggen nazheem 1560-06-06   3 nazheem 1560-06-06 0.6995502

I want id's 2 & 3 to receive roughly the same weightings as id #1 however currently they are much lower since their last names are not in the exact same position in both datasets (although the content is agreeing). Is there a way I can modify the string comparison function being used here / the structure of the data so that I can take account of the different orderings?

Additional Notes:

Both datasets have millions of rows so memory efficiency is definitely important here!
Sometimes the other dataset may have more than just a single last name so we'd be comparing 3 words against 2 words - would probably be best to start off with tackling the easy case first though
More often than not there will be spelling differences of the names between the two datasets
Currently we are using IBM's quality stage to do this linking and they use the "MULT_UNCERT" comparison function (https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.7.0/com.ibm.swg.im.iis.ds.design.help.doc/topics/r_qresfgde_MULT_UNCERT_comparison.html). I want to replicate this in R.

Original Q&A

There are 2 answers

Maharero On 11 February 2019 at 01:22

Addition I made to Khayenes answer as outlined in the comment:

library(gtools)

...

# Store the split up components of each comparison variable.
split1 <- strsplit(block_pairs[["lname.x"]], split)
split2 <- strsplit(block_pairs[["lname.y"]], split)

# Recombine tokens into all possible orderings:
make_combinations <- function(x) {
      # Use permutations from the gtools package
      split_names <- permutations(length(x),length(x),x)
      apply(X=split_names, MARGIN=1, FUN=paste0, collapse=' ')
}

split1 <- lapply(X=split1, FUN=`make_combinations`)
split2 <- lapply(X=split2, FUN=`make_combinations`)

# Perform jarowinkler comparisons on each string combination and append it to the table
block_pairs[ ,("winkler.lname") := mapply(function(x, y) max(outer(x, y, jarowinkler)), split1, split2)]

# Sort by the jarowinkler score
block_pairs <- block_pairs[order(winkler.lname)]

# 0.85 is an appropriate threshold in this instance
block_pairs <- block_pairs[winkler.lname >= 0.85]


      bday           lname.x             lname.y    winkler.lname
1: 1908-08-08  lolli gaggen nazeem         lolli     0.8526316
2: 1560-06-06  lolli gaggen nazeem        nazeem     0.8631579
3: 1979-12-12  lolli gaggen nazeem        gaggen     0.8631579
4: 1979-12-12           matt dowle        m dowl     0.9200000
5: 1560-06-06           john-smith  johnny smith     0.9666667

**Khaynes** · Accepted Answer · 2018-12-13T22:22:17+00:00

Have you thought about the following approach?

Record linkage and names are as I know you would know, difficult. Ideally you want to block on other available information (gender, unique identifiers, dob, location information etc.) and then do string comparisons on the names.

You mention large datasets with millions of records. Look no further than the data.table package by the great Matt Dowle (https://stackoverflow.com/users/403310/matt-dowle).

The RecordLinkage package is slow in comparison. You could easily improve the below code to think about string hashing techniques using soundex, double metaphone, nysiis etc.

# install.packages("data.table")
library(RecordLinkage)
library(data.table)

data1 <- as.data.frame(list("lname" = c("lolli gaggen nazeeem", "lolli gaggen nazeem", "lollly gaggen nazeem", "matt dowle", "john-smith"),
                           "bday" = c("1908-08-08", "1979-12-12", "1560-06-06", "1979-12-12", "1560-06-06") ) )

data2 <- as.data.frame(list("lname" = c("lolli", "gaggen", "nazeem", "m dowl", "johnny smith"),
                           "bday" = c("1908-08-08", "1979-12-12", "1560-06-06", "1979-12-12", "1560-06-06") ) )


# Coerce to data.tables
setDT(data1)
setDT(data2)

# Define a regex split (we will split all words based on space or hyphen)
split <- " |-"

# Apply a blocking strategy based on bday. Ideally your dataset would allow for additional blocking strategies(?).
block_pairs <- merge(data1, data2, by = "bday", all = T,
            sort = TRUE, suffixes = c(".x", ".y"))

# Store the split up components of each comparison variable.
split1 <- strsplit(block_pairs[["lname.x"]], split)
split2 <- strsplit(block_pairs[["lname.y"]], split)

# Perform jarowinkler comparisons on each combination of components of each string
fc <- jarowinkler(block_pairs[["lname.x"]], block_pairs[["lname.y"]])
pc <- mapply(function(x, y) max(outer(x, y, jarowinkler)), split1, split2)

# Store the max of the full and partial comparisons
block_pairs[, ("winkler.lname") := mapply(function(x,y) max(x,y), fc, pc)]


# Sort by the jarowinkler score
block_pairs <- block_pairs[order(winkler.lname)]

# Inspect
block_pairs

# 0.96 is an appropriate threshold in this instance
block_pairs <- block_pairs[winkler.lname >= 0.96]

TechQA.

Compare and link strings with different word orders / word counts

There are 2 answers

Related Questions in R

Related Questions in STRING-COMPARISON

Related Questions in FUZZY-COMPARISON

Related Questions in RECORD-LINKAGE

Related Questions in JARO-WINKLER

Popular Questions

Popular Tags

Trending Questions