Is there a better way to check for each element in a dataframe that it is contained in a given string?

46 views Asked by At

Let's say we have a dataframe df representing the activities of some people as follow:

index Mary Tristan Louise Arnaud Justin Stacy
0 Engineer Software Engineer Rock Singer Rap Singer Lumberjack Biomedical Engineer
1 Guitarist Aerospace Engineer Author Figherfighter
2 Business Man

And I would like to check if each activity is or might be software engineering. With s = 'Software Engineer', we would obtain:

index Mary Tristan Louise Arnaud Justin Stacy
0 True True False False False False
1 False False False False False False
2 False False False False False False

Which mean that I want to test for all cells in df that they are or are not a substring of s. What already works is the following, but it looks dirty:

s = 'Software Engineer'
df.apply(lambda col: col.apply(lambda x: str(x) in s))

What bothers me is the double apply, there might be a better solution right?

2

There are 2 answers

0
abdelgha4 On BEST ANSWER

To check every cell in your dataframe if it is a substring of s no need to numpy, you can use applymap :

df.applymap(lambda cell: bool(cell) and cell in s)

Note: bool(cell) is used to exclude empty and NaN cells and mark them as False.

Also if you want the other way around, ie. check if s is a substring of each cell, you can use vectorized string functions to further optimize your code:

df.apply(lambda column: column.str.contains(s))
0
Debi Prasad On

One of the methods that you can do is using the properties of numpy arrays and then getting the appropriate solution

# Let's assume df is your dataframe which contains all the information
df=df.fillna('None')
# replace the null values as None
values=df.values
boolean_values=values=='Software Engineer'

Now your boolean_values array will contain the data in the exact format you want, and now you can just reframe the dataframe in the way you want

cols=df.columns
df=pd.DataFrame(boolean_values,columns=cols)

And there you go you have the desired output.