I'm trying to understand the output of the onehotencoding process via python and scikit-learn. I believe that I get the idea of one hot encoding. I.e., convert discrete values into extended feature vectors with a value of 'on' to identify membership of a classification. Perhaps I got that wrong, which is confusing me but that's my understanding.
So, from the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
I see the following example:
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Could someone please explain how the data [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]] ends up being transformed into [[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]?
How is the transformation argument [0, 1, 1] used?
Many thanks for any help with this
Jon
So... after further digging, here is my attempt at clarifying one way of understanding this and answering it for others.
1) The original data set is [0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]
2) You then need to reduce these down (by position) to a list of unique ordered values:
So...
Now, when transforming this, you simply compare each positional item in the transformed array to the position in the list of unique ordered items
For the transformed array [0, 1, 1]
Put together, this equates to a [1, 0, 0, 1, 0, 0, 1, 0, 0].
I've tried this with a number of other data sets, and the logic is consistent.