How to speed up word removal in a dataframe of word lists?

70 views Asked by At

I am trying to remove non-dictionary words from a medium-sized (18k rows) pandas dataframe, but my approach is extremely slow. Basically, I have tried doing list comprehension and applying it to the entire dataframe. This works, but it is extremely slow, and I have not successfully vectorized this process. How can I do this?

My approach also seems to affect multiple dataframes at once, which I do not want to happen. How can I fix that?

The code below represents the approach I have taken, just with significantly less data:

import pandas as pd

df = pd.DataFrame({'Text': [['this', 'is', 'an', 'apple'], 
                            ['this', 'is', 'a', 'carrot'], 
                            ['this', 'is', 'a', 'steak']],
                   'Class': ['fruit', 'vegetable', 'meat']})

valid_words = ['apple', 'carrot', 'steak']

def dictwords(text):
  valid_text = [word for word in text if word in valid_words]
  return valid_text

clean = df

clean['Text'] = clean['Text'].apply(dictwords)

This works, but it is far too slow for my actual data. The real dataset has about 60k unique words - both valid and invalid - and I am trying to keep only about 30k of them. There are about 18k rows of texts. As one probably expects, using .apply() for this process takes an extremely long time.

I have tried njit/jit for parallelization but without much luck. What are some vectorization/parallelization techniques I can try for this data, and are there any better ways to do this than list comprehension?

Also, I found that when I applied dictwords() to the clean dataset, it also seemed to apply it to df. I'm not sure why this is the case or how to prevent this, so any explanation for this would be helpful as well. It seems to happen in all the Jupyter Notebook platforms I've tested.

2

There are 2 answers

0
Panda Kim On

Example

make 60k unique words, 30k target words, 18k rows dataframe sample

import pandas as pd
import numpy as np

# 60k unique word ('0' ~ '59999')
words = list(map(str, range(0, 60000)))

# target 30k word ('0' ~ '29999')
valid_words = words[:30000]

# random dataframe 18k rows
np.random.seed(0)
val = np.random.choice(words, (18000, 4)).tolist()
df = pd.DataFrame({'Text': val})

df

        Text
0       [2732, 43567, 42613, 52416]  <-- not numeric, string(think like word)
1       [45891, 21243, 30403, 32103]
2       [41993, 57043, 20757, 55026]
... ...
17997   [6688, 22472, 36124, 56143]
17998   [55253, 29436, 4113, 22639]
17999   [1128, 12103, 39056, 28174]
18000 rows × 1 columns

Code

Although it is not a complete vectorized operation, it is possible to vectorize the process of creating a DataFrame and replacing values not in valid_words with NaN. This will be faster than your task if you aggregate it back to a list.

out = (pd.DataFrame(df['Text'].tolist())[lambda x: x.isin(valid_words)]
       .apply(lambda x: list(x.dropna()),axis=1)
)

out

0                      [2732]
1                     [21243]
2                     [20757]
                 ...         
17997           [6688, 22472]
17998    [29436, 4113, 22639]
17999    [1128, 12103, 28174]
Length: 18000, dtype: object

The execution time is as follows:

970 ms ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0
Triky On

Since the column Text contains lists, you can use pandas explode and then use the isin function to get only the rows who have the words from your list.

You can use only the first two lines if each row contains only one of the words in your list, but if you want the words back in a list or if any of the rows contains multiple words from your list, you can use the third line too.

clean = df.explode('Text')
clean = clean[clean['Text'].isin(valid_words)]
clean = clean.groupby(clean.index).agg({'Text': list, 'Class':'first'})

end result

Text            Class
[apple, carrot] fruit
[carrot]        vegetable
[steak]         meat

The carrot in the first row was added by me in the test df to test in case you have multiple words in any of the rows.