Delete column that contains string in DataFrameGroupBy object so I can Standardize data

50 views Asked by At

I have a dataframe that i want to group by a column that contains text. After I group the dataframe by that certain column i want to standarize only the numerical columns. Let's say i have the following dataframe:

import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({"cost": [30,15,100,65,75,55,29,45], 
"sales":[80,88,70,80,999,70,8,95], 
"da_value":["low","low","high","medium","high","medium","low","medium"]
"names": ["Jo","Andrew","AI","Michael","Nikola","Jim","Bojan","Vurce"]})

 df #to see my dataframe

I want to group by "da_value" so i do this

df_dast=df.groupby("da_value")
df_dast=scaler.fit_transform(df_dast)

And i get the following error:

ValueError: could not convert string to float: 'high'

Also after the groupby, df_dast is '<pandas.core.groupby.generic.DataFrameGroupBy object'. Is there a way to delete the column 'da_value' after i do the groupby so i can standarize the numerical columns? I am not even sure I standardize a DataFrameGroupBy object as I tried to do groupby the 'sales' column (after I dropped the 'da_value' column) and then startardize and I got this error:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (6, 2) + inhomogeneous part.

I though about doing something like this

high=df_dast.get_group('high').drop(columns=['da_value'])

Which returns a dataframe ,without the column that contains text, that I can standarize, but I think it beats the purpose of using groupby.

Thank you

2

There are 2 answers

1
nisakova On

i think you cannot directly scale non-numeric (categorical) data using StandardScaler method, maybe applying One-Hot Encoding to the 'da_value' column

df_encoded = pd.get_dummies(df, columns=['da_value'])

df_encoded_scaled = pd.DataFrame(scaler.fit_transform(df_encoded), columns=df_encoded.columns)

0
user3430128 On

I think i found a way around it

df.groupby("da_value").transform(lambda x: (x -x.mean())/x.std() if x.dtype != 'object' else x.values)

This is what i get:

enter image description here