string matches are empty although string is contained in column

Question

string matches are empty although string is contained in column

38 views Asked by datascientistxxx At 31 May 2023 at 11:07

I have two dataframes: one dataframe df with one column containig text data and another dataframe econ_terms with two columns containing positive and negative economic terms.

I want to remove all text rows that do not contain any strings from 'positive' or 'negative' economic terms

# Convert the column to a string
df['text'] = df['text'].astype(str)
econ_terms['plus'] = econ_terms['plus'].astype(str)
econ_terms['minus'] = econ_terms['minus'].astype(str)

# Get the unique values from 'plus' and 'minus' columns in the 'econ_terms' DataFrame
econ_values = set(econ_terms['plus']).union(set(econ_terms['minus']))

# Filter the 'df' DataFrame using boolean indexing
df_filtered = df[df['text'].isin(econ_values)]

the column 'minus' contains words such as unemployment, which is clearly in the 'text' column when going through the data manually.

However the df_filtered shows an empty dataframe. What could be the reason for this ?

Original Q&A

There are 2 answers

Muthu Vijay On 31 May 2023 at 11:35

Your code is correct! it works perfectly for me. This is the sample code that I tried.

import pandas as pd

data1 = {
    'text': [9, 10, 3, 4],
}

data2 = {
    'plus': [1, 2, 5, 6],
    'minus': [3, 4, 7, 8]
}

df = pd.DataFrame(data1)
econ_terms = pd.DataFrame(data2)

df['text'] = df['text'].astype(str)
econ_terms['plus'] = econ_terms['plus'].astype(str)
econ_terms['minus'] = econ_terms['minus'].astype(str)

print(type(df['text'][0]))

econ_values = set(econ_terms['plus']).union(set(econ_terms['minus']))

print(econ_values)

df_filtered = df[df['text'].isin(econ_values)]

print(df_filtered)

Output:

   a  b
2  3  7
3  4  8

**Geom** · Accepted Answer · 2023-05-31T11:31:41+00:00

You are describing one thing but your code is doing an entirely different thing.

You want to search if 'text' contains the terms for 'plus' & 'minus' but your checking if the 'text' is contained within 'plus' & 'minus'. This example illustrates the problem:

import pandas as pd

d = {'text': ['equally good_value', 'bad_value', 'good_value']}
df = pd.DataFrame(data=d)

d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
econ_values = pd.DataFrame(data=d)

df['text'] = df['text'].astype(str)
econ_values['plus'] = econ_values['plus'].astype(str)
econ_values['minus'] = econ_values['minus'].astype(str)

econ_values = set(econ_values['plus']).union(set(econ_values['minus']))

df_filtered = df[df['text'].isin(econ_values)]

print(df_filtered)

Output:

         text
2  good_value

Notice how 'equally good_value' disappeared because it was not in 'plus' & 'minus' even though it contains 'good_value'.

You need to search the other way around:

import pandas as pd

d = {'text': ['equally good_value', 'bad_value', 'good_value']}
df = pd.DataFrame(data=d)

d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
econ_values = pd.DataFrame(data=d)

df['text'] = df['text'].astype(str)
econ_values['plus'] = econ_values['plus'].astype(str)
econ_values['minus'] = econ_values['minus'].astype(str)

econ_values = set(econ_values['plus']).union(set(econ_values['minus']))

def search_text(row):
    match_count = 0
    for term in econ_values:
        if term in row:
            match_count +=1
    if match_count > 0:
        return True
    else:
        return False

df['contains'] = df['text'].apply(lambda row: search_text(row))

Output:

                 text     contains
0  equally good_value      True
1           bad_value     False
2          good_value      True

Then you can filter rows:

df_filtered = df.loc[df['contains']==True]
print(df_filtered)

Output:

                 text  contains
0  equally good_value  True
2          good_value  True

The example above reverses the search. It checks if 'plus' & 'minus' are inside 'text' not the other way around.

TechQA.

string matches are empty although string is contained in column

There are 2 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in ISIN

Popular Questions

Trending Questions