Reducing the Sparsity of a One-Hot Encoded dataset

2.1k views Asked by At

I'm trying to do some feature selection algorithms on the UCI adult data set and I'm running into a problem with Univaraite feature selection. I'm doing onehot encoding on all the categorical data to change them to numerical but that gives me a lot of f scores.

How can I avoid this? What should I do to make this code better?

# Encode
adult['Gender'] = adult['sex'].map({'Female': 0, 'Male': 1}).astype(int)
adult = adult.drop(['sex'], axis=1)

adult['Earnings'] = adult['income'].map({'<=50K': 0, '>50K': 1}).astype(int)
adult = adult.drop(['income'], axis=1)

#OneHot Encode
adult = pd.get_dummies(adult, columns=["race"])

target = adult["Earnings"]
data = adult.drop(["Earnings"], axis=1)

selector = SelectKBest(f_classif, k=5)
selector.fit_transform(data, target)

for n,s in zip( data.head(0), selector.scores_):
    print "F Score ", s,"for feature ", n

EDIT:
Partial results of current code:
F Score 26.1375747945 for feature race_Amer-Indian-Eskimo
F Score 3.91592196913 for feature race_Asian-Pac-Islander
F Score 237.173133254 for feature race_Black
F Score 31.117798305 for feature race_Other
F Score 218.117092671 for feature race_White

Expected Results:
F Score "f_score" for feature "race"

By doing the one hot encoding the feature in above is split into many sub-features, where I would just like to generalize it to just race (see Expected Results) if that is possible.

1

There are 1 answers

1
Little Bobby Tables On BEST ANSWER

One way in which you can reduce the number of features, whilst still encoding your categories in a non-ordinal manner, is by using binary encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate. In other words, doubling the number of categories adds a single column for binary encoding, where as it doubles the number of columns for one-hot encoding.

Binary encoding can be easily implemented in python by using the categorical_encoding package. The package is pip installable and works very seamlessly with sklearn and pandas. Here is an example

import pandas as pd
import category_encoders as ce

df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})

enc_bin = ce.binary_encoding.BinaryEncoding(cols=['cat1']) # cols=None, all string columns encoded

df_trans = enc_bin.fit_transform(df)
print(df_trans)


Out[1]:
           cat1_0  cat1_1 cat2
    0       1       1      C
    1       0       1      S
    2       1       0      T
    3       0       0      B

Here is the code from a previous answer by me using the same variables as above but with one-hot encoding. Lets compare how the two different outputs look.

import pandas as pd
import category_encoders as ce

df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1']) # cols=None, all string columns encoded

df_trans = enc_ohe.fit_transform(df)
print(df_trans)


Out[2]:
       cat1_0  cat1_1  cat1_2  cat1_3 cat2
    0       0       0       1       0    C
    1       0       0       0       1    S
    2       1       0       0       0    T
    3       0       1       0       0    B

See how binary encoding uses half as many columns to uniquely describe each category within the category cat1.