R: Using plyr to perform fuzzy string matching between matching subsets of two data sources

3.4k views Asked by At

Say I have a list of counties with varying amounts of spelling errors or other issues that differentiate them from the 2010 FIPS dataset (code to create fips dataframe below), but the states in which the misspelled counties reside are entered correctly. Here's a sample of 21 random observations from my full dataset:

tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
                                     "Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline", 
                                     "Richmond City", "Saint Mary", "Saint Louis City", "Mclean", 
                                     "Union", "Bienville", "Covington City", "Martinsville City", 
                                     "Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA", 
             "LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN", 
             "MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
             ), row.names = c(NA, -21L))

              county state
1         Beauregard    LA
2            De Soto    LA
3             Dekalb    GA
4            Webster    LA
5       Saint Joseph    IN
6     West Feliciana    LA
7  Ketchikan Gateway    AK
8         Evangeline    LA
9      Richmond City    VA
10        Saint Mary    LA
11  Saint Louis City    MO
12            Mclean    KY
13             Union    LA
14         Bienville    LA
15    Covington City    VA
16 Martinsville City    VA
17         Claiborne    LA
18    King And Queen    VA
19            Mclean    ND
20            Mcminn    TN
21    Prince Georges    MD

I've used adist to create a fuzzy string matching algorithm that matches around 80% of my counties to the county names in fips. However, sometimes it will match two counties with similar spelling, but from different states (e.g., "Webster, LA" gets matched to "Webster, GA" rather than "Webster Parrish, LA").

distance <- adist(tomatch$county, 
                  fips$countyname, 
                  partial = TRUE)


min.name <- apply(distance, 1, min)

matchedcounties <- NULL  

for(i in 1:nrow(distance)) {

  s2.i <- match(min.name[i], distance[i, ])
  s1.i <- i

  matchedcounties <- rbind(data.frame(s2.i = s2.i,
                                      s1.i = s1.i,
                                      s1name = tomatch[s1.i, ]$county, 
                                      s2name = fips[s2.i, ]$countyname, 
                                      adist = min.name[i]),
                           matchedcounties)

}

Therefore, I want to restrict fuzzy string matching of county to the correctly spelled versions with matching state.

My current algorithm makes one big matrix which calculates standard Levenshtein distances between both sources and then selects the value with the minimum distance.

To solve my problem, I'm guessing I'd need to create a function that could be applied to each 'state' group by ddply, but I'm confused as to how I should indicate that the group value in the ddply function should match another dataframe. A dplyr solution or solution using any other package would be appreciated as well.

Code to create FIPS dataset:

download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
              './nationalfips.txt')

fips <- read.csv('./nationalfips.txt', 
                 stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')

# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
3

There are 3 answers

1
MattV On BEST ANSWER

Here's a way with dplyr. I first join the tomatch data.frame with the FIPS names by state (allowing only in-state matches):

require(dplyr)
df <- tomatch %>% 
  left_join(fips, by="state")

Next, I noticed that a lot of counties don't have 'Saint' but 'St.' in the FIPS dataset. Cleaning that up first should improve the results obtained.

df <- df %>%
    mutate(county_clean = gsub("Saint", "St.", county))

Then, group this data.frame by county, and calculate the distance with adist:

df <- df %>%
  group_by(county_clean) %>%                # Calculate the distance per county
  mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
  arrange(county, dist) # Used this for visual inspection.

Note that I took the diagonal from the resulting matrix as adist returns an n x m matrix with n representing the x vector and m representing the y vector (it calculates all of the combinations). Optionally, you could add the agrep result:

df <- df %>%
  rowwise() %>% # 'group_by' a single row. 
  mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
  ungroup()   # Always a good idea to remove 'groups' after you're done.

Then filter as you did before, take the minimum distance:

df <- df %>%
  group_by(county_clean) %>%   # Causes it to calculate the 'min' per group
  filter(dist == min(dist)) %>%
  ungroup()

Note that this could result in more than one row returned for each of the input rows in tomatch.
Alternatively, do it all in one run (I usually change code to this format once I'm confident it's doing what it's supposed to do):

df <- tomatch %>% 
  # Join on all names in the relevant state and clean 'St.'
  left_join(fips, by="state") %>%
  mutate(county_clean = gsub("Saint", "St.", county)) %>% 

  # Calculate the distances, per original county name.
  group_by(county_clean) %>%                
  mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%

  # Append the agrepl result
  rowwise() %>%
  mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
  ungroup() %>%  

  # Only retain minimum distances
  group_by(county_clean) %>%   
  filter(dist == min(dist))

The result in both cases:

              county      county_clean state                countyname dist string_agrep
1         Beauregard        Beauregard    LA         Beauregard Parish    0         TRUE
2            De Soto           De Soto    LA            De Soto Parish    0         TRUE
3             Dekalb            Dekalb    GA                    DeKalb    1         TRUE
4            Webster           Webster    LA            Webster Parish    0         TRUE
5       Saint Joseph        St. Joseph    IN                St. Joseph    0         TRUE
6     West Feliciana    West Feliciana    LA     West Feliciana Parish    0         TRUE
7  Ketchikan Gateway Ketchikan Gateway    AK Ketchikan Gateway Borough    0         TRUE
8         Evangeline        Evangeline    LA         Evangeline Parish    0         TRUE
9      Richmond City     Richmond City    VA             Richmond city    1         TRUE
10        Saint Mary          St. Mary    LA           St. Mary Parish    0         TRUE
11  Saint Louis City    St. Louis City    MO            St. Louis city    1         TRUE
12            Mclean            Mclean    KY                    McLean    1         TRUE
13             Union             Union    LA              Union Parish    0         TRUE
14         Bienville         Bienville    LA          Bienville Parish    0         TRUE
15    Covington City    Covington City    VA            Covington city    1         TRUE
16 Martinsville City Martinsville City    VA         Martinsville city    1         TRUE
17         Claiborne         Claiborne    LA          Claiborne Parish    0         TRUE
18    King And Queen    King And Queen    VA            King and Queen    1         TRUE
19            Mclean            Mclean    ND                    McLean    1         TRUE
20            Mcminn            Mcminn    TN                    McMinn    1         TRUE
21    Prince Georges    Prince Georges    MD           Prince George's    1         TRU  
4
cole On

Don't have example data but try something using agrep instead of adist and searching only the names in that state

sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)

You can use the max.distance argument in agrep to vary how close they need to match. Also, setting value=TRUE returns the value of the matched string rather than the location of the match.

0
beniaminogreen On

My R package, zoomerjoin provides this functionality in the lsh_join family of functions, which allow you to fuzzy match while specifying a variable to block on. The syntax is pretty similar to the dplyr family of joins with a few of extra hyperparameters. To solve the problem as you describe, you can use the following code:

jaccard_inner_join(tomatch, tomatch,
               by = "county",
               block_by = "state",
               n_gram_width = 4,
               n_bands = 200,
               )

#  A tibble: 21 × 4
#   county.x          state.x county.y          state.y
#   <chr>             <chr>   <chr>             <chr>
# 1 Prince Georges    MD      Prince Georges    MD
# 2 Richmond City     VA      Richmond City     VA
# 3 Mclean            ND      Mclean            ND
# 4 Dekalb            GA      Dekalb            GA
# 5 West Feliciana    LA      West Feliciana    LA
# 6 Ketchikan Gateway AK      Ketchikan Gateway AK
# 7 Covington City    VA      Covington City    VA
# 8 De Soto           LA      De Soto           LA
# 9 Bienville         LA      Bienville         LA
#10 Saint Mary        LA      Saint Mary        LA
# … with 11 more rows
# ℹ Use `print(n = ...)` to see more rows

In this example the 'by' argument works exactly as it does in dplyr, while the "block_by" parameter can be used to indicate that you would only like fuzzy matches within a specific state. The extra hyperparameter n_bands is used to specify the number of search iterations used to search for matches (the package uses an Locality-Sensitive Hashing algorithm that allows it to scale to datasets with rows in the hundreds of millions).

The package is still in development, so please let me know if you have any issues with the code or documentation.