OneHotEncoding transformation interpretation

322 views Asked by At

I'm trying to understand the output of the onehotencoding process via python and scikit-learn. I believe that I get the idea of one hot encoding. I.e., convert discrete values into extended feature vectors with a value of 'on' to identify membership of a classification. Perhaps I got that wrong, which is confusing me but that's my understanding.

So, from the documentation here:

I see the following example:

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>>[[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Could someone please explain how the data [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]] ends up being transformed into [[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]?

How is the transformation argument [0, 1, 1] used?

Many thanks for any help with this



There are 2 answers


So... after further digging, here is my attempt at clarifying one way of understanding this and answering it for others.

1) The original data set is [0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]

2) You then need to reduce these down (by position) to a list of unique ordered values:


For position 1 (0, 1, 0, 1) --> [0, 1]
For position 2 (0, 1, 2, 0) --> [0, 1, 2]
For position 3 (3, 0, 1, 2) --> [0, 1, 2, 3]

Now, when transforming this, you simply compare each positional item in the transformed array to the position in the list of unique ordered items

For the transformed array [0, 1, 1]

The first '0' generates a [1, 0] ('0' matches value in position one, not position two)
The next '1' generates a [0, 1, 0] ('1' only matches value in position two)
the last '1' generates a [0, 1, 0, 0] ('1' only matches value in position two)

Put together, this equates to a [1, 0, 0, 1, 0, 0, 1, 0, 0].

I've tried this with a number of other data sets, and the logic is consistent.

silviomoreto On

The main goal of one hot encoding is for categorical features, where there is no spatial relation between numbers, they are not continuos. So if a feature has value 1 it does not means that it is closer to 2 than 3.

To avoid that we must create a column for each value that a feature can have in a binary way. One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

So, in your example, note that what you are transforming is the array: [0, 1, 1].

Remember that the transformation will make this array binary to the possible encoding, resulting in the array: [ 1., 0., 0., 1., 0., 0., 1., 0., 0.]

The first and second ones can have 2 values, while the third can have 4 values (note that to fit we pass only 3 (0, 2, 3) and in the transform we pass 1 as well.

So, the first two elements explain the first feature, the next two explain the second feature and the last four explain the third feature.