Nominal valued dataset in machine learning

1.2k views Asked by At

What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning?

Should I map each nominal value to real value?

For example, if I want to make my program to learn a predictive model for an web servie users whose input features may include

{ gender(boolean), age(real), job(nominal) }

where dependent variable may be the number of web-site login.

The variable job may be one of

{ PROGRAMMER, ARTIST, CIVIL SERVANT... }.

Should I map PROGRAMMER to 0, ARTIST to 1 and etc.?

2

There are 2 answers

2
Has QUIT--Anony-Mousse On BEST ANSWER

Do a one-hot encoding, if anything.

If your data has categorial attributes, it is recommended to use an algorithm that can deal with such data well without the hack of encoding, e.g decision trees and random forests.

2
Calia Kim On

If you read the book called "Machine Learning with Spark", the author wrote,


Categorical features

Categorical features cannot be used as input in their raw form, as they are not numbers; instead, they are members of a set of possible values that the variable can take. In the example mentioned earlier, user occupation is a categorical variable that can take the value of student, programmer, and so on.

:

To transform categorical variables into a numerical representation, we can use a common approach known as 1-of-k encoding. An approach such as 1-of-k encoding is required to represent nominal variables in a way that makes sense for machine learning tasks. Ordinal variables might be used in their raw form but are often encoded in the same way as nominal variables.

:


I had exactly the same thought.

I think that if there is a meaningful(well-designed) transformation function that maps categorical(nominal) to real values, I may also use learning algorithms that only takes numerical vectors.

Actually I've done some projects where I had to do that way and there was no issue raised concerning the performance of learning system.

To someone who took a vote against my question, please cancel your evaluation.