Label Encoding using weights for string nominal variables for random forest classification

262 views Asked by At

I have NYC 311 complaint dataset. I want to build a random forest classifier which will take categorical input features about a complaint and will determine the complaint type. Following are the input feature of a given complaint record

X = df[['Location Type', 'Incident Zip', 'Street Name', 
'City', 'Borough', 'Open Data Channel Type']]

all of these features are nominal variables(categorical) I will need to convert the string variables into float ones before feeding them to the model. I am reluctant to use one hot label encoding since some features has more than 1000 categories and further computation might be out of reach of my laptop.

I was thinking of using the weights of each variable (count of particular category/ total count) instead of the nominal string variables? will it be a good strategy?

2

There are 2 answers

0
Ion Lesan On

Replacing a category with its relative frequency is not a good idea, because category frequency as a continuous variable will not reflect their semantics.

One-hot encoding is the way to go, it will just require an additional dimensionality reduction step. Here are some options:

  1. For each categorical variable, do compute category frequencies, but use them to discard any low-frequency values, i.e. after hot-encoding keep the columns that correspond to top N values. You may be tempted to think that every value with only a few occurrences will contribute to predictivity, but it will not. Statistically a category that's only occurring a few times is not sufficient to generalize from it and serve as a predictor.
  2. You can try applying PCA to each one-hot encoded variable. However in theory PCA requires normally distributed variables, so using it for binary variables is debatable.
  3. Instead of one-hot encoding, try encoding using something like char2vec (e.g. using https://pypi.org/project/gensim/). The advantage is it will create similar vectors for alternative or mis-spellings of your categories, e.g. the vectors for 'color' and 'colour' will be very close to each other which is what you want.
  4. Do some additional research for other dimensionality reduction techniques.
0
StupidWolf On

Random forest is an ensemble of decision trees where you try to divide your data into subsets based on splits of your variables. If it is coding each categorical variable in terms of their frequency, this is not very sound. This assumes that categories with similar frequencies will perform likewise in predicting the response, and there is nothing in your data to suggest that.

In the case you have 1000+ categories, it might make more sense to group some of the rare categories or singletons into 1 big category like "others" before doing the onehot encoding.