I have a row that I would like to filter for in a dataframe.

ch=b611067=football

My question is I would like to just filter for the b'611067 section.

I understand I can use the follow str.startswith('b') to find the start of the ID but what I am looking for is a way to say something like str.contains('random 6 digit numberical value'

Hope this makes sense.

2 Answers

2
Community On

I am not sure (yet) how to do this efficiently in pandas, but you can use regex for the match:

import re

pattern = '(b\d{6})'
text = 'ch=b611067=football'
matches = re.findall(pattern=pattern, string=text)
for match in matches:
    pass # do something

Edit: this answer explains how to use regex with pandas: How to filter rows in pandas by regex

1
sgvd On

You can use the .str accessor to use string functions on string columns, including matching by regexp:

import pandas as pd
df = pd.DataFrame(data={"foo": ["us=b611068=handball", "ch=b611067=football", "de=b611069=hockey"]})
print(df.foo.str.match(r'.+=b611067=.+'))

Output:

0    False
1     True
2     False
Name: foo, dtype: bool

You can use this to index the dataframe, so for instance:

print(df[df.foo.str.match(r'.+=b611067=.+')])

Output:

                   foo
1  ch=b611067=football

If you want all rows that match the pattern b<6 numbers>, you can use the expression provided by tobias_k:

df.foo.str.match(r'.+=b[0-9]{6}=.+')

Note, this gives the same result as df.foo.str.contains(r'=b611067=') which doesn't require you to provide the wildcards and is the solution given in How to filter rows in pandas by regex, but as mentioned in the Pandas docs, with match you can be stricter.