I have two dataframes, df1
and df2
, that have information about polling stations. The dataframes are of different lengths. Both dataframes have a column called ps_name
, which is the name of the polling stations, and a column called district
that indicates which district the polling stations are located.
I am trying to match strings on the ps_name
column while blocking on the district
column, so I can copy a geolocations
(latitude and longitude) column on matches from df1
to df2
.
So far I've tried using jaro-winkler at threshold 0.88
to compare strings.
# Matched:
**df1:** AGRICULTURAL OFFICE ATTOCK (MALE) I (P)
**df2:** AGRICULTURAL OFFICE ATTOCK (MALE) (P)
# Did not match:
**df1:** govt girls high school peoples colony attock ii
**df2:** high school peoples colony attock ii
What string distance algorithm should I be using? I've tried jaro-winkler and was also considering smith-waterman.
One option is to use Levenshtein distance which is implemented in the package fuzzywuzzy (or here), the algorithm runs in O(n + d^2), where n is the length of the longer string and d is the edit distance.
Example: