Applying Jaro-Winkler distance to two dataframes

931 views Asked by At

I have two dataframes of unequal length and would like to compare the similarity of strings in df2 with df1. Is it possible to apply Jaro-Winkler distance method to calculate the string similarity on two dataframes through map/lambda function.

df1
Behavioral disorders
Behçet disease
AV-Block

df2
Behavioral disorder
Behçet syndrome

The desired output is:

name_left                 name_right            score   
Behavioral disorders      Behavioral disorder   0.933333
Behçet disease            Behçet syndrome       0.865342

The scores mentioned above are hypothetical. Any help is highly appreciated

1

There are 1 answers

6
mozway On

Assuming you want the max score and that the original columns in the input are "name":

# pip install jaro-winkler
# https://pypi.org/project/jaro-winkler/
from jaro import jaro_winkler_metric as jw

pd.DataFrame([[n2, *max([(n1, jw(n1, n2)) for n1 in df1['name']],
                        lambda x: x[1])]
              for n2 in df2['name']],
              index=df2.index,
              columns=['name_right', 'name_left', 'score']
            )[['name_left', 'name_right', 'score']]