Parsing data with Pandas - how to output match as a new column

Question

Parsing data with Pandas - how to output match as a new column

133 views Asked by Marcus Webb At 05 March 2024 at 13:54

I've got a routine to read in a CSV file and spit out selected columns that match specific criteria:

CSV Input File looks like this

Name	Role	Login
Phil	Role A \| Role B	2024/01/01
Bob	Role A \| Role B	2024/02/01
Arthur	Role A \| Role C	2024/01/04
Jane	Role B \| Role C	2024/01/31
Mary	Role A \| Role D	2024/02/12
Liz	Role B \| Role F	2024/02/21
Phoebe	Role C \| Role D	2023/11/21
Mike	Role E	2024/02/15
Rick	Role D \| Role E	2024/01/13
Hilary	Role F	2024/01/11

I have a block of code that matches based on a passed value:

# Define function to check if a value matches any of the filter values
def matches_filter(value):
    value_lower = value.lower()
    for filter_value in value_lower.split("|"):
        filter_value_lower = filter_value.lower()
        for fvals in fltr_values:
            if fvals.lower() in filter_value_lower:
                return fvals.lower()
    return None

# Apply filter
# filtered_df = df[df[fltr_field].apply(matches_filter)]
df[fltr_field + "_matched"] = df[fltr_field].apply(matches_filter)

Based on passing the values "Role B" and "Role D" I'd like to replace whatever is in the Role column with the result of the filter. The net resulting table should therefore look something like this:

Name	Role	Login
Phil	Role B	2024/01/01
Bob	Role B	2024/01/01
Jane	Role B	2024/02/03
Mary	Role D	2024/02/02
Liz	Role B	2024/02/12
Phoebe	Role D	2024/02/21
Rick	Role D	2024/01/31

So far, the code will filter so I only get strings containing "Role B" or "Role D" but I'd like to replace the string found with the match criterion, rather than the list of roles. Can someone explain what I need to change here?

To further explain based on comments received so far:

What is the content of fltr_field?

fltr_field contains the name of the column to filter on (in this case, I'm filtering on the column called "Role".

Can you clearly explain what you're trying to do here?

I wish to replace the contents of the Role column with the matched value.

What is the nature of the "Login" column?

The "Login" column contains the last login date.

Original Q&A

There are 1 answers

**OCa** · Answer 1 · 2024-03-06T11:36:02+00:00

With this input dataframe generator:

df = pd.DataFrame({'Name': {0: 'Phil', 1: 'Bob', 2: 'Arthur', 3: 'Jane', 4: 'Mary', 5: 'Liz', 6: 'Phoebe', 7: 'Mike', 8: 'Rick', 9: 'Hilary'}, 
                   'Role': {0: 'Role A | Role B', 1: 'Role A | Role B', 2: 'Role A | Role C', 3: 'Role B | Role C', 4: 'Role A | Role D', 5: 'Role B | Role F', 6: 'Role C | Role D', 7: 'Role E', 8: 'Role D | Role E', 9: 'Role F'}, 
                   'Login': {0: '2024/01/01', 1: '2024/02/01', 2: '2024/01/04', 3: '2024/01/31', 4: '2024/02/12', 5: '2024/02/21', 6: '2023/11/21', 7: '2024/02/15', 8: '2024/01/13', 9: '2024/01/11'}})

     Name             Role       Login
0    Phil  Role A | Role B  2024/01/01
1     Bob  Role A | Role B  2024/02/01
2  Arthur  Role A | Role C  2024/01/04
3    Jane  Role B | Role C  2024/01/31
4    Mary  Role A | Role D  2024/02/12
5     Liz  Role B | Role F  2024/02/21
6  Phoebe  Role C | Role D  2023/11/21
7    Mike           Role E  2024/02/15
8    Rick  Role D | Role E  2024/01/13
9  Hilary           Role F  2024/01/11

A regex pattern extraction looks like your best option:

# Define search - insert any number of Role letters inside the brackets
role_pattern = "(Role\s[BD])"
# Generate filtered table
df.assign(Role = df['Role'].str.extract(pat=role_pattern)
 ).dropna(subset = ['Role'])

     Name    Role       Login
0    Phil  Role B  2024/01/01
1     Bob  Role B  2024/02/01
3    Jane  Role B  2024/01/31
4    Mary  Role D  2024/02/12
5     Liz  Role B  2024/02/21
6  Phoebe  Role D  2023/11/21
8    Rick  Role D  2024/01/13

I saw no need for the complexity of a dynamic column name (I mean your df[fltr_field + "_matched"]) since the final column in your desired output is called "Role" anyways.

Update to match various roles as whole words
With alternative input,

filtering roles "Alpha" and "Role B",
in presence of other roles such as "Role A", "Role AB", "Aphal", "Alpha A"*

df = pd.DataFrame({'Name': {0: 'Phil', 1: 'Bob', 2: 'Arthur', 3: 'Jane', 4: 'Mary', 5: 'Liz', 6: 'Phoebe', 7: 'Mike', 8: 'Rick', 9: 'Hilary'}, 
                   'Role': {0: 'Role A | Role B', 1: 'Role A | Role B', 2: 'Role A | Role C', 3: 'Role B | Role C', 4: 'Role A | Role D', 5: 'Role BA | Role F', 6: 'Role C | Role D', 7: 'Aphal', 8: 'Alpha B | Role E', 9: 'Alpha'},
                   'Login': {0: '2024/01/01', 1: '2024/02/01', 2: '2024/01/04', 3: '2024/01/31', 4: '2024/02/12', 5: '2024/02/21', 6: '2023/11/21', 7: '2024/02/15', 8: '2024/01/13', 9: '2024/01/11'}}

     Name              Role       Login
0    Phil   Role A | Role B  2024/01/01
1     Bob   Role A | Role B  2024/02/01
2  Arthur   Role A | Role C  2024/01/04
3    Jane   Role B | Role C  2024/01/31
4    Mary   Role A | Role D  2024/02/12
5     Liz  Role BA | Role F  2024/02/21
6  Phoebe   Role C | Role D  2023/11/21
7    Mike             Aphal  2024/02/15
8    Rick  Alpha B | Role E  2024/01/13
9  Hilary             Alpha  2024/01/11

# simple variant for whole words - caution: will mistake a "Role BA" for a "Role B":
role_pattern = "(Alpha|Role B)"
df.assign(Role = df['Role'].str.extract(pat=role_pattern)
 ).dropna(subset = ['Role'])

# Safe variant for whole words that prevents this: (with lookahead assertion)
role_pattern = '(Alpha(?=($|\s|))|Role B(?=($|\s|)))'
df.assign(Role = df['Role'].str.extract(pat=role_pattern)[0]
 ).dropna(subset = ['Role'])

     Name    Role       Login
0    Phil  Role B  2024/01/01
1     Bob  Role B  2024/02/01
3    Jane  Role B  2024/01/31
9  Hilary   Alpha  2024/01/11

The "safe variant" with lookahead assertion will function as long as roles are entered with " | " in between. It makes sure the role name is followed by

either the end of the string ("$"),
or your vertical separator after a space ("\s|").

Similarly, we could not only "look ahead" but also "look back"; you see: regex complexity will have to catch-up with the diversity of expected "roles", but there is no need to make it more complex than it needs to be.

References

TechQA.

Parsing data with Pandas - how to output match as a new column

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in REGEX

Related Questions in DATAFRAME

Related Questions in SUBSTITUTION

Popular Questions

Trending Questions