OneHotEncoding transformation interpretation

295 views Asked by At

I'm trying to understand the output of the onehotencoding process via python and scikit-learn. I believe that I get the idea of one hot encoding. I.e., convert discrete values into extended feature vectors with a value of 'on' to identify membership of a classification. Perhaps I got that wrong, which is confusing me but that's my understanding.

So, from the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

I see the following example:

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Could someone please explain how the data [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]] ends up being transformed into [[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]?

How is the transformation argument [0, 1, 1] used?

Many thanks for any help with this

Jon

2

There are 2 answers

0
Jon M On BEST ANSWER

So... after further digging, here is my attempt at clarifying one way of understanding this and answering it for others.

1) The original data set is [0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]

2) You then need to reduce these down (by position) to a list of unique ordered values:

So...

For position 1 (0, 1, 0, 1) --> [0, 1]
For position 2 (0, 1, 2, 0) --> [0, 1, 2]
For position 3 (3, 0, 1, 2) --> [0, 1, 2, 3]

Now, when transforming this, you simply compare each positional item in the transformed array to the position in the list of unique ordered items

For the transformed array [0, 1, 1]

The first '0' generates a [1, 0] ('0' matches value in position one, not position two)
The next '1' generates a [0, 1, 0] ('1' only matches value in position two)
the last '1' generates a [0, 1, 0, 0] ('1' only matches value in position two)

Put together, this equates to a [1, 0, 0, 1, 0, 0, 1, 0, 0].

I've tried this with a number of other data sets, and the logic is consistent.

3
silviomoreto On

The main goal of one hot encoding is for categorical features, where there is no spatial relation between numbers, they are not continuos. So if a feature has value 1 it does not means that it is closer to 2 than 3.

To avoid that we must create a column for each value that a feature can have in a binary way. One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

So, in your example, note that what you are transforming is the array: [0, 1, 1].

Remember that the transformation will make this array binary to the possible encoding, resulting in the array: [ 1., 0., 0., 1., 0., 0., 1., 0., 0.]

The first and second ones can have 2 values, while the third can have 4 values (note that to fit we pass only 3 (0, 2, 3) and in the transform we pass 1 as well.

So, the first two elements explain the first feature, the next two explain the second feature and the last four explain the third feature.