Assigning a value to a column based on a mapping defined in a dictionary

75 views Asked by At

I am trying to implement a code that reads a csv file, creates a data frame out of it, and then tags each row the name of the key, if one of the columns in that row contains the same string as in the key of the dataframe.

As an example, I have the following dictionary defined:

Sdiction={
        "Mgage" : ["ABC Gage","XYZ Gage"],
        "Rate" : ["deg/min","rad/s","rpm"]}

And I have the following dataframe:

Col A Col B Col C Col D
1 30 ABC Gage
2 45 deg/min
3 150 Gage

I want to tag Col D for each row as

Row 1 - Col D = Mgage (since ABC Gage exists in the key Mgage)

Row 2 - Col D = Rate (Since deg/min exists in the key Rate)

Row 3 - Col D = Mgage (since the string Gage exists in the key Mgage, albeit partially)

Expected output:

Col A Col B Col C Col D
1 30 ABC Gage Mgage
2 45 deg/min Rate
3 150 Gage Mgage

I am trying to figure out how to implement this part, have not yet implemented it, and therefore need help.

2

There are 2 answers

0
mozway On BEST ANSWER

Using a regex match:

import re

s = df['Col C'].str.casefold()
pattern = '(%s)' % '|'.join(map(re.escape, s))
# '(abc\\ gage|deg/min|gage)'

# reverse dictionary
tmp = pd.Series({v.casefold(): k for k, l in Sdiction.items()
                 for v in l}, name='ref').reset_index()

# extract first match, map reference key
df['Col D'] = s.map(tmp.assign(match=tmp['index'].str.extract(pattern))
                       .dropna(subset=['match'])
                       .set_index('match')['ref']
                    )

Output:

   Col A  Col B     Col C  Col D
0      1     30  ABC Gage  Mgage
1      2     45   deg/min   Rate
2      3    150      Gage  Mgage
0
Soudipta Dutta On

NumPy's vectorized operations like np.isin and str.lower are optimized for efficient handling of large arrays, leading to faster execution.

This method consumes slightly more memory due to the creation of the lookup table, but this is usually offset by the faster execution speeds.

import pandas as pd
import numpy as np

dict = {
    "Mgage": ["ABC Gage", "XYZ Gage"],
    "Rate": ["deg/min", "rad/s", "rpm"]
}
df = pd.DataFrame({
    "Col A": [1, 2, 3, 4, 5],
    "Col B": [30, 45, 150, 70, 60],
    "Col C": ["ABC Gage", "deg/min", "Gage", "rad/s", "rpm"]
})

lookup_table = {v.lower() : k for k,l in dict.items() for v in l }
"""print(lookup_table)
{'abc gage': 'Mgage', 'xyz gage': 'Mgage', 'deg/min': 'Rate', 'rad/s': 'Rate', 'rpm': 'Rate'}
"""
df['Col_C_lower'] = df['Col C'].str.lower()

# Vectorized matching using NumPy
matches  = np.isin(df['Col_C_lower'].to_numpy(),list(lookup_table.keys()))
"""print(matches)
[ True  True False  True  True]"""

# Map matches to dictionary keys using the lookup table
df['Col_Matches'] = df['Col_C_lower'].map(lookup_table).where(matches, df['Col C'])

# Drop the temporary column, optional
#df.drop('Col_C_lower', axis=1, inplace=True)
"""print(df)
   Col A  Col B     Col C Col_C_lower   Col_Matches
0      1     30  ABC Gage    abc gage       Mgage
1      2     45   deg/min     deg/min        Rate
2      3    150      Gage        gage        Gage
3      4     70     rad/s       rad/s        Rate
4      5     60       rpm         rpm        Rate"""