How to stop regex matchingbefore a special character

51 views Asked by At

I'm trying to work around regex using python and I'm cleaning a dataset. Below is the sample.

Player
DG Bradman (AUS)
HC Brook (ENG)

I am trying to use regex to split the player name and the country. I am aware of the fact that we can use str.split but i would like to see if there is a possibility of using regex in achieving it.

Country=Player_column.str.extract(r"(\B\(.+)")
Player=Player_column.str.extract(r"([^a-z]\$(.)")
df['Country'] = Country
df['Player'] = Player
df

So I was able to figure out to extract the part within the brackets (Country name) but I'm not able to understand how to extract the player information alone. Could someone help me with this pls?

2

There are 2 answers

0
jmcgriz On BEST ANSWER

If all of the lines match that format, you can extract the 3 data points with a small regex: [^ )(]+

That will return each sequence of characters that doesn't contain a space or parenthesis, so in this example you'd get ['DG', 'Bradman', 'AUS'] back

import re

inputstring = "DG Bradman (AUS)"

print(re.findall("[^ )(]+", inputstring))
0
Wiktor Stribiżew On

You can use

df[['Player', 'Country']] = df['Player'].str.extract(r'^(.*?)\s*\(([^()]*)\)')

See the regex demo. Mind the two capturing groups, there are necessary to populate the two column values with str.Series.extract.

Details

  • ^ - start of string
  • (.*?) - Group 1: any zero or more chars other than line break chars as few as possible (the *? lazy quantifier is used to let \s* consume all possible whitespaces later)
  • \s* - zero or more whitespaces
  • \( - a ( char
  • ([^()]*) - Group 2: any zero or more chars other than ( and )
  • \) - a ) char.