I have a dataframe that i want to group by a column that contains text. After I group the dataframe by that certain column i want to standarize only the numerical columns. Let's say i have the following dataframe:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({"cost": [30,15,100,65,75,55,29,45],
"sales":[80,88,70,80,999,70,8,95],
"da_value":["low","low","high","medium","high","medium","low","medium"]
"names": ["Jo","Andrew","AI","Michael","Nikola","Jim","Bojan","Vurce"]})
df #to see my dataframe
I want to group by "da_value" so i do this
df_dast=df.groupby("da_value")
df_dast=scaler.fit_transform(df_dast)
And i get the following error:
ValueError: could not convert string to float: 'high'
Also after the groupby, df_dast is '<pandas.core.groupby.generic.DataFrameGroupBy object'. Is there a way to delete the column 'da_value' after i do the groupby so i can standarize the numerical columns? I am not even sure I standardize a DataFrameGroupBy object as I tried to do groupby the 'sales' column (after I dropped the 'da_value' column) and then startardize and I got this error:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (6, 2) + inhomogeneous part.
I though about doing something like this
high=df_dast.get_group('high').drop(columns=['da_value'])
Which returns a dataframe ,without the column that contains text, that I can standarize, but I think it beats the purpose of using groupby.
Thank you
i think you cannot directly scale non-numeric (categorical) data using StandardScaler method, maybe applying One-Hot Encoding to the 'da_value' column
df_encoded = pd.get_dummies(df, columns=['da_value'])
df_encoded_scaled = pd.DataFrame(scaler.fit_transform(df_encoded), columns=df_encoded.columns)