How to speed up pandas boolean indexing with multiple string conditions

538 views Asked by At

I have a 73 million row dataset, and I need to filter out rows that match any of a few conditions. I have been doing this with Boolean indexing, but it's taking a really long time (~30mins) and I want to know if I can make it faster (e.g. fancy indexing, np.where, np.compress?)

My code:

clean_df = df[~(df.project_name.isin(p_to_drop) | 
                df.workspace_name.isin(ws_to_drop) | 
                df.campaign_name.str.contains(regex_string,regex=True) | 
                df.campaign_name.isin(small_launches))]

The regex string is

regex_string = '(?i)^.*ARCHIVE.*$|^.*birthday.*$|^.*bundle.*$|^.*Competition followups.*$|^.*consent.*$|^.*DOI.*$|\
                    ^.*experiment.*$|^.*hello.*$|^.*new subscribers.*$|^.*not purchased.*$|^.*parent.*$|\
                    ^.*re engagement.*$|^.*reengagement.*$|^.*re-engagement.*$|^.*resend.*$|^.*Resend of.*$|\
                    ^.*reward.*$|^.*survey.*$|^.*test.*$|^.*thank.*$|^.*welcome.*$'

and the other three conditions are lists of strings with fewer than 50 items.

2

There are 2 answers

0
Rutger On BEST ANSWER

If you have this many rows, I think it will be faster to first remove the records one step at a time. Regex is typically slow, so you could use that as a last step with a much smaller data frame.

For example:

clean_df = df.copy()
clean_df = clean_df.loc[~(df.project_name.isin(p_to_drop)]
clean_df = clean_df.loc[~df.workspace_name.isin(ws_to_drop)]
clean_df = clean_df.loc[~df.campaign_name.isin(small_launches)]
clean_df = clean_df.loc[~df.campaign_name.str.contains(regex_string,regex=True)]
2
travelsandbooks On

I had thought that chaining my conditions was a good idea but the answer about making them consecutive helped me rethink: each time I ran a Boolean indexing operation, I was making the dataset smaller - therefore cheaper for the next operation.

I've separated them out, as suggested, and put the operation that gets rid of the most rows at the top, so the next ones are quicker. I've put the regex last - because it's expensive, it makes sense to do it on the smallest df possible.

Hope this helps someone! TIL chaining your operations looks good but it's not efficient :)