Is there a way to use functions and not nested loops to improve the run time of the following process?

185 views Asked by At

My problem is that I have a vector including comments and another vector including names which I am trying to find in the comments. My approach is using fuzzywuzzy and assigning in a first step each name the points it gets for the corresponding comment. So that I will be able to say which names were likely mentioned in which comment in a further step

For instance the data could look like that:

# in the original data both would be DFs with more than 1 column
names = pd.DataFrame(["Anna Starkow", "Marian Mueller", "James Leo Arneau"))

# spelling mistakes and abbreviations
Comments=pd.DataFrame(["Ana Starkov was super good!", "M. mueller is in great shape", "I like Arneau and Starkow's work","Thanks Anna Starkov, credits to JL Arneau"])

# my approach:

# This is also from stack
N = len(Comments.index)
M = len(names.index)

res= pd.DataFrame([[0] * M]*N)

#My code
DF=pd.concat([Comments.reset_index(drop=True), res],axis=1)


for x in range(len(Comments.index)):
    for i in range(len(names.index)):
        DF[x, i+1]=fuzz.token_set_ratio(DF.Comments[x],names.names[i])

However, it ran forever and didn't come back with results.

I'd expect something like this to come back:

# Comments                         Anna Starkow  Marian Mueller ....
# Ana Starkov was super good!               80               0  ....
# M. mueller is in great shape               0              70  ....
# ....                                    ....            ....  ....

Is there a more efficient way to do this?

I hope I have no error in the code because I had to type it from another machine where I'm not permitted to use Stack.

3

There are 3 answers

2
maxbachmann On BEST ANSWER

The other answers concentrate on ways to speed up assigning the results. While this might slightly help with performance, your real issue is that the string matching with fuzzywuzzy is really slow. There are two parts you could optimise:

  1. When passing two strings to fuzz.token_sort_ratio it will preprocess these strings by lowercasing them, removing non alphanumeric characters and trimming whitespaces. Since your iterating over names multiple times your repeating this work

  2. Even when using python-Levenshtein FuzzyWuzzy is not really optimised in a lot of places. You should replace it with RapidFuzz which implements the same algorithms with a similar interface, but is mostly implemented in C++ and comes with some additional algorithmic improvements making it a lot faster.

from rapidfuzz import process, utils


processed_names = [utils.default_process(name) for name in names.names]

for x in range(len(Comments.index)):
    for i, name in enumerate(processed_names):
        DF[x, i+1]=fuzz.token_sort_ratio(
          utils.default_process(DF.Comments[x]), name, processor=None)

in case your fine with bad matches returning a score of 0 you could further improve the performance by passing a score_cutoff:

fuzz.token_sort_ratio(
  utils.default_process(DF.Comments[x]), name,
  processor=None, score_cutoff=<insert your minimum score here>)
0
Sarah Messer On

The issue here isn't Python so much as DataFrames, which work similarly to SQL tables whether you're dealing with Pandas or PySpark: Whenever possible, you should vectorize operations on the DF. This lets the computer worry about parallelizing the algorithm.

If you have a pre-existing DF, you can efficiently apply a function to every element using pandas.DataFrame.apply().

In your case, it looks more like you're just looking for a better way to initialize the DF. If you can describe your DF contents as a list of dictionaries (with one dict per record), I recommend using pandas.DataFrame.from_records(). Each dict in the list will have a form like {'Comments': 'Ana Starkov was super good!', 'Anna Starkow': 80, 'Marian Mueller': 0} (The from_dict() method is similar in concept, but has a slightly different input format.)

This will be significantly faster than building / rewriting a DF cell-by-cell.

0
Roei Levy On

You can try the following:

 i=0
 l0=[]
 for comment in Comments[0]:
   l=[]
   for name in names[0]:
 
     l.append(fuzz.token_sort_ratio(comment,name))
 
   l0.append(l)
 
 DF_out=pd.concat([Comments.reset_index(drop=True), pd.DataFrame(np.matrix(l0)) ],axis=1)