I have NYC 311 complaint dataset. I want to build a random forest classifier which will take categorical input features about a complaint and will determine the complaint type. Following are the input feature of a given complaint record
X = df[['Location Type', 'Incident Zip', 'Street Name',
'City', 'Borough', 'Open Data Channel Type']]
all of these features are nominal variables(categorical) I will need to convert the string variables into float ones before feeding them to the model. I am reluctant to use one hot label encoding since some features has more than 1000 categories and further computation might be out of reach of my laptop.
I was thinking of using the weights of each variable (count of particular category/ total count) instead of the nominal string variables? will it be a good strategy?
Replacing a category with its relative frequency is not a good idea, because category frequency as a continuous variable will not reflect their semantics.
One-hot encoding is the way to go, it will just require an additional dimensionality reduction step. Here are some options: