My problem is that I have a vector including comments and another vector including names which I am trying to find in the comments. My approach is using fuzzywuzzy and assigning in a first step each name the points it gets for the corresponding comment. So that I will be able to say which names were likely mentioned in which comment in a further step
For instance the data could look like that:
# in the original data both would be DFs with more than 1 column
names = pd.DataFrame(["Anna Starkow", "Marian Mueller", "James Leo Arneau"))
# spelling mistakes and abbreviations
Comments=pd.DataFrame(["Ana Starkov was super good!", "M. mueller is in great shape", "I like Arneau and Starkow's work","Thanks Anna Starkov, credits to JL Arneau"])
# my approach:
# This is also from stack
N = len(Comments.index)
M = len(names.index)
res= pd.DataFrame([[0] * M]*N)
#My code
DF=pd.concat([Comments.reset_index(drop=True), res],axis=1)
for x in range(len(Comments.index)):
for i in range(len(names.index)):
DF[x, i+1]=fuzz.token_set_ratio(DF.Comments[x],names.names[i])
However, it ran forever and didn't come back with results.
I'd expect something like this to come back:
# Comments Anna Starkow Marian Mueller ....
# Ana Starkov was super good! 80 0 ....
# M. mueller is in great shape 0 70 ....
# .... .... .... ....
Is there a more efficient way to do this?
I hope I have no error in the code because I had to type it from another machine where I'm not permitted to use Stack.
The other answers concentrate on ways to speed up assigning the results. While this might slightly help with performance, your real issue is that the string matching with fuzzywuzzy is really slow. There are two parts you could optimise:
When passing two strings to fuzz.token_sort_ratio it will preprocess these strings by lowercasing them, removing non alphanumeric characters and trimming whitespaces. Since your iterating over names multiple times your repeating this work
Even when using python-Levenshtein FuzzyWuzzy is not really optimised in a lot of places. You should replace it with RapidFuzz which implements the same algorithms with a similar interface, but is mostly implemented in C++ and comes with some additional algorithmic improvements making it a lot faster.
in case your fine with bad matches returning a score of 0 you could further improve the performance by passing a
score_cutoff
: