I have text column in df1 and text column in df2. The length of df2 will be different to that of length of df1. I want to calculare cosine similarity for every entry in df1[text] against every entry in df2[text] and give a score for every match.
sample input
df1
mahesh
suresh
df2
surendra
mahesh
shrivatsa
suresh
maheshwari
sample output
mahesh surendra 30
mahesh mahesh 100
mahesh shrivatsa 20
mahesh suresh 60
mahesh maheshwari 80
suresh surendra 70
suresh mahesh 60
suresh shrivatsa 40
suresh suresh 100
suresh maheshwari 30
i was facing issues( getting key errors) when I was trying to match these two columns for similarity using tf-idf approach as these columns were of different lengths . is there any other way through we can solve this problem... Any help would be greatly appreicated. I have searched a lot and found that in almost all cases people were comparing the first document against rest of documents in the same corpus. here it is like comparing every document of corpus 1 with every document on corpus2 .
There are many different string distance measures. I can't be sure how to use cosine similarity for this case, though I suggest looking into a
strsim
library.I'll give you an example of how I would approach the issue using
Jaro-Winkler
metric which is best suited for short strings.Also, I'm including my attempt to use
cosine similarity
given the example from the documentation of said library.It could be completely wrong but should give you a general idea of how to make dataframe from the cartesian product of two columns of different lengths, as well as how to apply
strsim
's algorithms to the data stored inpd.DataFrame
Data preparation:
returns:
Jaro-Winkler:
returns:
Cosine similarity:
returns: