I'm working with Twitter data in a dataframe. I want to filter the column that holds the text of each tweet according to a certain keyword found within the text.

I've tried str.contains but that doesn't work, as the column is a Series. I want to filter the "text" column for all the tweets containing the keyword 'remoaners'.

remoaners_only = time_plus_text[time_plus_text["text"].str.contains("remoaners", case=False, na=False)]

This produces either an empty dataframe or a lot of NaNs.

pandas version is 0.24.1.

Here's the input data: time_plus_text["text"].head(10)


0    [ #bbcqt Remoaners on about post Brexit racial...
1    [@sarahwollaston Shut up, you like all remoane...
2    [ what have the Brextremists ever done for us ...
3                     [ Remoaner in bizarre outburst ]
4    [ Anyone who disagrees with brexit is called n...
5    [ @SkyNewsBreak They forecasted if the vote wa...
6    [ but we ARE LEAVING THE #EU, even the #TORIES...
7    [ Can unelected Remoaner peers not see how abs...
8    [@sizjam68 @LeaveEUOfficial @johnredwood It wo...
9    [ Hey @BBC have you explained why when award w...
Name: text, dtype: object

2 Answers

0
Ben.T On Best Solutions

The problem is that the string you want to find the substring remoaners in is contained in a list in each cell. You need to access this string by doing str[0] before doing str.contains such as:

# input
time_plus_text = pd.DataFrame({'text':[['#bbcqt Remoaners on about post Brexit racial...'], 
                                       ['@sarahwollaston Shut up, you like all remoaners...'],
                                       ['what have the Brextremists ever done for us ...']]})
print (time_plus_text["text"].str[0].str.contains("remoaners", case=False, na=False))
0     True
1     True
2    False
Name: text, dtype: bool

so you should do:

remoaners_only = time_plus_text[time_plus_text["text"].str[0]\
                                             .str.contains("remoaners", case=False, na=False)]
0
Rich Andrews On

Your code works. So you are left to check your input data or the pandas bug fix version, 0.24.1 vs 0.24.2.

0.24.2
   index                                               text
0      0     [ #bbcqt Remoaners on about post Brexit rac...
import pandas as pd
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

print(pd.__version__)

csvdata = StringIO("""0,   [ #bbcqt Remoaners on about post Brexit racial...
1,   [@sarahwollaston Shut up, you like all remoane...
2,   [ what have the Brextremists ever done for us ...
3,                    [ Remoaner in bizarre outburst ]
4,   [ Anyone who disagrees with brexit is called n...
5,   [ @SkyNewsBreak They forecasted if the vote wa...
6,   [ but we ARE LEAVING THE #EU, even the #TORIES...
7,   [ Can unelected Remoaner peers not see how abs...
8,   [@sizjam68 @LeaveEUOfficial @johnredwood It wo...
9,   [ Hey @BBC have you explained why when award w...""")

df = pd.read_csv(csvdata, names=["index", "text"], sep=",")

result = df[df["text"].str.contains("remoaners", case=False, na=False)]

# results
print(result)