I am trying to remove non-dictionary words from a medium-sized (18k rows) pandas dataframe, but my approach is extremely slow. Basically, I have tried doing list comprehension and applying it to the entire dataframe. This works, but it is extremely slow, and I have not successfully vectorized this process. How can I do this?
My approach also seems to affect multiple dataframes at once, which I do not want to happen. How can I fix that?
The code below represents the approach I have taken, just with significantly less data:
import pandas as pd
df = pd.DataFrame({'Text': [['this', 'is', 'an', 'apple'],
['this', 'is', 'a', 'carrot'],
['this', 'is', 'a', 'steak']],
'Class': ['fruit', 'vegetable', 'meat']})
valid_words = ['apple', 'carrot', 'steak']
def dictwords(text):
valid_text = [word for word in text if word in valid_words]
return valid_text
clean = df
clean['Text'] = clean['Text'].apply(dictwords)
This works, but it is far too slow for my actual data. The real dataset has about 60k unique words - both valid and invalid - and I am trying to keep only about 30k of them. There are about 18k rows of texts. As one probably expects, using .apply() for this process takes an extremely long time.
I have tried njit/jit for parallelization but without much luck. What are some vectorization/parallelization techniques I can try for this data, and are there any better ways to do this than list comprehension?
Also, I found that when I applied dictwords() to the clean dataset, it also seemed to apply it to df. I'm not sure why this is the case or how to prevent this, so any explanation for this would be helpful as well. It seems to happen in all the Jupyter Notebook platforms I've tested.
Example
make 60k unique words, 30k target words, 18k rows dataframe sample
df
Code
Although it is not a complete vectorized operation, it is possible to vectorize the process of creating a DataFrame and replacing values not in
valid_wordswith NaN. This will be faster than your task if you aggregate it back to a list.out
The execution time is as follows: