Pandas string operations (extract and findall)

1.7k views Asked by At

Here are 2 examples on string operation methods from Python data science handbook, that I am having troubles understanding.

  1. str.extract()
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                              'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte.str.extract('([A-Za-z]+)')

This operation returns the first name of each element in the Series. I don't get the expression input in the extract function.

  1. str.findall()
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

This operation returns the original element if it starts and ends with consonants, returns an empty list otherwise. I figure that the ^ operator stands for negation of vowels. * operator combines the situations of upper and lower cases of vowels. Yet I do not understand the rest of the operators.

Please help me with understanding these input expressions. Thanks in advance.

2

There are 2 answers

2
U13-Forward On BEST ANSWER

The first ^ means in the beginning of the string, whereas $ means in the end of the string, here is an example:

>>> import re
>>> s = 'a123a'
>>> re.findall('^a', s)
['a']
>>> 

This only prints one a because I have the ^ sign which only finds in the begging of the string.

This is the same for $, $ only finds stuff from the end of the string, here is an example:

>>> import re
>>> s = 'a123a'
>>> re.findall('a$', s)
['a']
>>> 

Edited:

The meaning of r is a raw string. Raw string it is what it looks like. For example, a backslash \ doesn't escape, it will just be a regular backslash.

0
robbo On

Your first example:

'([A-Za-z]+)'

refers to a group marked by the '()' that contains any combination of upper and lower case characters (the values between square brackets). The + sign behind the brackets means you want one or more of them. So it basically matches any combination of letters until it finds a 'non'-letter, which in your case would be the space between the first and last names. For this reason the regular expression returns the first name of each row.

For your second example:

'^[^AEIOU].*[^aeiou]$'

The first ^ means the start of the string then the second ^ in the square brackets means the negation as you have mentioned (so matches anything that except what sits in the square brackets). So the first part here means that your match should start with an uppercase non-vowel. It is then followed by a .* where the '.' means any character (except line break; so this is no longer related to your consonants) and the '*' means zero or more values of them. So so far your regular expression is saying: start with an uppercase non-vowel followed by any combination of letters. The final part: '[^aeiou]$' indicates that your string should end with a lower-case non-vowel. This is dictated by the $ sign which represents the end of the string.

So yes, here you're effectively returning only the matches that start with an uppercase consonant and end with a lowercase consonant.