one-hot encoder implementation pandas.get_dummies, how to read syntax

236 views Asked by At

I am reading pandas documentation to understand pandas.get_dummies

>>> import pandas as pd
>>> l = list('abca')
>>> print l
['a', 'b', 'c', 'a']
>>> s = pd.Series(l)
>>> print s
0    a
1    b
2    c
3    a

I have created a Series as shown above.

When I called get_dummies on this series, the output is as below

>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

What does it mean I could not understand.

Can we say the new values of the entries are as below?

a --> 100
b --> 010
c --> 001
a --> 100

Also, are they decimal or binary?

2

There are 2 answers

0
piRSquared On BEST ANSWER

dummy variables are features that are binary. Like a single column that says whether each row is or isn't some thing. When we have an existing column that has multiple values, more than 1. We can split those values into a single column for each unique value. Each new column is either one signifying that the row had that unique value, or it is zero signifying that the row did not have that unique value.

Since each row of s had only one value, it stands to reason that each row of zeros and ones will only have one-one under the column header that was the value for the corresponding row in s

   a  b  c
0  1  0  0  # 1 is under `a` which was the value in `s` for this row.
1  0  1  0
2  0  0  1
3  1  0  0

Put another way, think of the new a column as telling you where the as were in s.

2
FabienP On

This is one-hot encoding.

   a  b  c
0  1  0  0  <-- a, not b, not c in row 0
1  0  1  0  <-- not a, b, not c in row 1 
2  0  0  1  <-- not a, not b, c in row 2
3  1  0  0  <-- a, not b, not c in row 3

Consider reading this for another example.