I have two dataframes: one dataframe df with one column containig text data and another dataframe econ_terms with two columns containing positive and negative economic terms.
I want to remove all text rows that do not contain any strings from 'positive' or 'negative' economic terms
# Convert the column to a string
df['text'] = df['text'].astype(str)
econ_terms['plus'] = econ_terms['plus'].astype(str)
econ_terms['minus'] = econ_terms['minus'].astype(str)
# Get the unique values from 'plus' and 'minus' columns in the 'econ_terms' DataFrame
econ_values = set(econ_terms['plus']).union(set(econ_terms['minus']))
# Filter the 'df' DataFrame using boolean indexing
df_filtered = df[df['text'].isin(econ_values)]
the column 'minus' contains words such as unemployment, which is clearly in the 'text' column when going through the data manually.
However the df_filtered shows an empty dataframe. What could be the reason for this ?
You are describing one thing but your code is doing an entirely different thing.
You want to search if 'text' contains the terms for 'plus' & 'minus' but your checking if the 'text' is contained within 'plus' & 'minus'. This example illustrates the problem:
Output:
Notice how 'equally good_value' disappeared because it was not in 'plus' & 'minus' even though it contains 'good_value'.
You need to search the other way around:
Output:
Then you can filter rows:
Output:
The example above reverses the search. It checks if 'plus' & 'minus' are inside 'text' not the other way around.