How do you extract a substring within column that contains people's Name & title that are in "Myles, Mr. Thomas Francis" format and only want "Mr."

84 views Asked by At

enter image description hereenter image description hereWant to add matched results as new column of dataframe within a python function

I tried using re.search() expression

for i in input_df["Name"]:
   Title[i] = re.search(".$",i)

I get Type_error and not sure how to write pattern to get desired result

2

There are 2 answers

3
Tim Biegeleisen On BEST ANSWER

You could use str.extract here with the regex pattern \b[A-Z][a-z]+\.:

input_df["Title"] = input_df["Name"].str.extract(r'\b([A-Z][a-z]+\.)')

For a more sophisticated option, you could also use str.replace:

input_df["Title"] = input_df["Name"].str.replace(r'^.*,\s+|\s+.*$', '', regex=True)
1
hught On

re.search returns a "match object" if a match is found, or None if no match is found. So you may want to do something like:

for i in input_df["Name"]:
    x = re.search("Mr[s]*\.",i)
    if x:
        Title[i] = (x.group())

Your code sets the value of Title[i] to a match object instead of a string, which is probably where the type error is coming from. Use the .group() method to return just the matching part of the string. Use the if statement to handle cases where no match was found.

As for the regex, I don't know what your data looks like exactly but you could try matching one more more capital letters, followed by zero or more lower case letters, followed by a period. Eg re.search("[A-Z]+[a-z]*\.",i)

Be warned regex can trip up over edge cases, so check carefully.