Im new to python and I want to remove rows from a pandas dataframe based on substrings on one of its columns. How can I do that?

So far, i managed to locate where the substring is on each row but im not able to get the substring itself so i could remove the lines they reference.

An example goes like this:

a = [['a', 1, 'abc 15 hij on 11/11/18'], ['b', 2, np.nan], ['c',3, 'efg abc 25'], ['a', 15, np.nan], ['c', 25, np.nan], ['a', 10, np.nan]]
df = pd.DataFrame(a)
df.columns = ['Id', 'Action', 'description']

That gives me the df:

  Id  Action             description
0  a       1  abc 15 hij on 11/11/18
1  b       2                     NaN
2  c       3              efg abc 25
3  a      15                     NaN
4  c      25                     NaN
5  a      10                     NaN

On this case, id like to remove lines 3 and 4 because the numbers on column 'Action' (15 and 25) are referenced on the column description after the pattern 'abc'. what i got done so far is :

b = df.description
c = b.str.find('abc')
d = c+4
e = b.str.get(d)

But when i use the .str.get function it returns the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

1 Answers

0
hkyi On Best Solutions

pandas.Series.str.extract may help you.

excludes = set(df.description.str.extract('abc (\d+)')[0].values) - set([np.nan])
df[~df['Action'].isin(excludes)]

which yields:

  Id  Action             description
0  a       1  abc 15 hij on 11/11/18
1  b       2                     NaN
2  c       3              efg abc 25
5  a      10                     NaN