Initializing Pandas DF Columns if any Substrings in Another Column

36 views Asked by At

My dataframe has a summary column with plain text. I also have a dictionary matching new column names as keys to lists of keywords as values. I'd like to add all those columns to my dataframe with each row initialized as 1 if any of their associated keywords is contained in my summary or -99 if no keywords are present.

Here's my code trying to accomplish this:

# headers is a list of strings, keywords is a list of lists.  Each column has a list of keywords
KEYWORDS_DICT = dict(zip(headers, keywords))

for column in KEYWORDS_DICT:
    df[column] = np.where(any(df['summary'].str.contains(keyword) for keyword in KEYWORDS_DICT[column]), 1, -99)
        

It's currently giving me 'ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().' Is there a good way to resolve this or another way to accomplish my goal?

Thanks!

2

There are 2 answers

0
zishaf On BEST ANSWER

The proposed answer gave me all 1s for all columns. I was able to get my desired result by calling '|'.join() on my keyword lists then searching my summary for that string.

0
Suraj Shourie On

You have to add a .any after your str.contains, see code below:

# temp data
df = pd.DataFrame({'summary': ["abc", "qwe", "xyz"]})
KEYWORDS_DICT = {'col1': ["abc", "xyz"], "col2": ["nm"]}

# note the added .any()
for column in KEYWORDS_DICT:
    df[column] = np.where(any(df['summary'].str.contains(keyword).any() for keyword in KEYWORDS_DICT[column]), 1, -99)

Output:

{'summary': {0: 'abc', 1: 'qwe', 2: 'xyz'},
 'col1': {0: 1, 1: 1, 2: 1},
 'col2': {0: -99, 1: -99, 2: -99}}