How to speed up pandas boolean indexing with multiple string conditions

Question

How to speed up pandas boolean indexing with multiple string conditions

530 views Asked by travelsandbooks At 18 May 2021 at 11:07

I have a 73 million row dataset, and I need to filter out rows that match any of a few conditions. I have been doing this with Boolean indexing, but it's taking a really long time (~30mins) and I want to know if I can make it faster (e.g. fancy indexing, np.where, np.compress?)

My code:

clean_df = df[~(df.project_name.isin(p_to_drop) | 
                df.workspace_name.isin(ws_to_drop) | 
                df.campaign_name.str.contains(regex_string,regex=True) | 
                df.campaign_name.isin(small_launches))]

The regex string is

regex_string = '(?i)^.*ARCHIVE.*$|^.*birthday.*$|^.*bundle.*$|^.*Competition followups.*$|^.*consent.*$|^.*DOI.*$|\
                    ^.*experiment.*$|^.*hello.*$|^.*new subscribers.*$|^.*not purchased.*$|^.*parent.*$|\
                    ^.*re engagement.*$|^.*reengagement.*$|^.*re-engagement.*$|^.*resend.*$|^.*Resend of.*$|\
                    ^.*reward.*$|^.*survey.*$|^.*test.*$|^.*thank.*$|^.*welcome.*$'

and the other three conditions are lists of strings with fewer than 50 items.

Original Q&A

There are 2 answers

travelsandbooks On 18 May 2021 at 15:52

I had thought that chaining my conditions was a good idea but the answer about making them consecutive helped me rethink: each time I ran a Boolean indexing operation, I was making the dataset smaller - therefore cheaper for the next operation.

I've separated them out, as suggested, and put the operation that gets rid of the most rows at the top, so the next ones are quicker. I've put the regex last - because it's expensive, it makes sense to do it on the smallest df possible.

Hope this helps someone! TIL chaining your operations looks good but it's not efficient :)

**Rutger** · Accepted Answer · 2021-05-18T11:13:09+00:00

If you have this many rows, I think it will be faster to first remove the records one step at a time. Regex is typically slow, so you could use that as a last step with a much smaller data frame.

For example:

clean_df = df.copy()
clean_df = clean_df.loc[~(df.project_name.isin(p_to_drop)]
clean_df = clean_df.loc[~df.workspace_name.isin(ws_to_drop)]
clean_df = clean_df.loc[~df.campaign_name.isin(small_launches)]
clean_df = clean_df.loc[~df.campaign_name.str.contains(regex_string,regex=True)]

TechQA.

How to speed up pandas boolean indexing with multiple string conditions

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in STRING-OPERATIONS

Related Questions in BOOLEAN-INDEXING

Popular Questions

Popular Tags

Trending Questions