string matches are empty although string is contained in column

38 views Asked by At

I have two dataframes: one dataframe df with one column containig text data and another dataframe econ_terms with two columns containing positive and negative economic terms.

I want to remove all text rows that do not contain any strings from 'positive' or 'negative' economic terms

# Convert the column to a string
df['text'] = df['text'].astype(str)
econ_terms['plus'] = econ_terms['plus'].astype(str)
econ_terms['minus'] = econ_terms['minus'].astype(str)

# Get the unique values from 'plus' and 'minus' columns in the 'econ_terms' DataFrame
econ_values = set(econ_terms['plus']).union(set(econ_terms['minus']))

# Filter the 'df' DataFrame using boolean indexing
df_filtered = df[df['text'].isin(econ_values)]

the column 'minus' contains words such as unemployment, which is clearly in the 'text' column when going through the data manually.

However the df_filtered shows an empty dataframe. What could be the reason for this ?

2

There are 2 answers

0
Geom On BEST ANSWER

You are describing one thing but your code is doing an entirely different thing.

You want to search if 'text' contains the terms for 'plus' & 'minus' but your checking if the 'text' is contained within 'plus' & 'minus'. This example illustrates the problem:

import pandas as pd

d = {'text': ['equally good_value', 'bad_value', 'good_value']}
df = pd.DataFrame(data=d)

d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
econ_values = pd.DataFrame(data=d)

df['text'] = df['text'].astype(str)
econ_values['plus'] = econ_values['plus'].astype(str)
econ_values['minus'] = econ_values['minus'].astype(str)

econ_values = set(econ_values['plus']).union(set(econ_values['minus']))

df_filtered = df[df['text'].isin(econ_values)]

print(df_filtered)

Output:

         text
2  good_value

Notice how 'equally good_value' disappeared because it was not in 'plus' & 'minus' even though it contains 'good_value'.

You need to search the other way around:

import pandas as pd

d = {'text': ['equally good_value', 'bad_value', 'good_value']}
df = pd.DataFrame(data=d)

d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
econ_values = pd.DataFrame(data=d)

df['text'] = df['text'].astype(str)
econ_values['plus'] = econ_values['plus'].astype(str)
econ_values['minus'] = econ_values['minus'].astype(str)

econ_values = set(econ_values['plus']).union(set(econ_values['minus']))

def search_text(row):
    match_count = 0
    for term in econ_values:
        if term in row:
            match_count +=1
    if match_count > 0:
        return True
    else:
        return False

df['contains'] = df['text'].apply(lambda row: search_text(row))

Output:

                 text     contains
0  equally good_value      True
1           bad_value     False
2          good_value      True

Then you can filter rows:

df_filtered = df.loc[df['contains']==True]
print(df_filtered)

Output:

                 text  contains
0  equally good_value  True
2          good_value  True

The example above reverses the search. It checks if 'plus' & 'minus' are inside 'text' not the other way around.

0
Muthu Vijay On

Your code is correct! it works perfectly for me. This is the sample code that I tried.

import pandas as pd

data1 = {
    'text': [9, 10, 3, 4],
}

data2 = {
    'plus': [1, 2, 5, 6],
    'minus': [3, 4, 7, 8]
}

df = pd.DataFrame(data1)
econ_terms = pd.DataFrame(data2)

df['text'] = df['text'].astype(str)
econ_terms['plus'] = econ_terms['plus'].astype(str)
econ_terms['minus'] = econ_terms['minus'].astype(str)

print(type(df['text'][0]))

econ_values = set(econ_terms['plus']).union(set(econ_terms['minus']))

print(econ_values)

df_filtered = df[df['text'].isin(econ_values)]

print(df_filtered)

Output:

   a  b
2  3  7
3  4  8