I understand how to use factorize to encode levels of a factor, such as "L" and "W" (for wins and loses) into numeric values, such as "0" and "1":
import pandas as pd
first_df = pd.DataFrame({'outcome': ["L", "L", "W", "W"]})
pd.factorize(first_df['outcome'])
The above returns (array([0, 0, 1, 1]), array(['L', 'W'], dtype=object))
.
However, later on, I'd like to combine this result with some other results, where we now have a new outcome, a draw ("D"), and here is where things get sticky:
second_df = pd.DataFrame({'outcome': ["L", "L", "D", "D"]})
pd.factorize(second_df['outcome'])
This returns (array([0, 0, 1, 1]), array(['L', 'D'], dtype=object))
I need some way to preemptively declare the fact that there are 3 different levels when I create the dataframes, and map the correct numeric value to the correct level. How can I achieve this?
Something like this is definitely possible using a
Categorical
:The semantics of
Categorical
s may not be exactly the same as the output ofpd.factorize()
, but thecodes
attribute contains your data as numeric values, it's just that theCategorical
is also aware of the unobserved'D'
value: