I have two dataframes as csv files where df1
has more rows than df2
:
Df1
Name Count
xxx yyyyyy bbb cccc 15
fffdd 444 ggg 20
kkbbb ccc dd 29p 5
22 cc pbc2 kmn3 b23 efgh 4
ccccccccc sss qqqq 2
Df2
Name
xxx yyyyyy bbb cccc
ccccccccc sss qqqq pppc
22 cc pbc2 kmn3 b23,efgh
I want to do partial matching(approximate/fuzzy matching) by matching either first two/three words. Basically the output will be like this:
Output:
Name Count
xxx yyyyyy bbb cccc 15
22 cc pbc2 kmn3 b23 efgh 4
ccccccccc sss qqqq 2
By trying exact matching, I'm missing some of the rows. I tried with agrep
in R but somehow its not working and fuzzy matching is quite slow. Please suggest me a way to do this in R or python. Any help is appreciated!
In R, you can use
agrep
for fuzzy matching. You can use themax.distance
parameter to set the maximum distance allowed for a match.The data: