I understand how to use factorize to encode levels of a factor, such as "L" and "W" (for wins and loses) into numeric values, such as "0" and "1":
import pandas as pd
first_df = pd.DataFrame({'outcome': ["L", "L", "W", "W"]})
pd.factorize(first_df['outcome'])
The above returns (array([0, 0, 1, 1]), array(['L', 'W'], dtype=object)).
However, later on, I'd like to combine this result with some other results, where we now have a new outcome, a draw ("D"), and here is where things get sticky:
second_df = pd.DataFrame({'outcome': ["L", "L", "D", "D"]})
pd.factorize(second_df['outcome'])
This returns (array([0, 0, 1, 1]), array(['L', 'D'], dtype=object))
I need some way to preemptively declare the fact that there are 3 different levels when I create the dataframes, and map the correct numeric value to the correct level. How can I achieve this?
Something like this is definitely possible using a
Categorical:The semantics of
Categoricals may not be exactly the same as the output ofpd.factorize(), but thecodesattribute contains your data as numeric values, it's just that theCategoricalis also aware of the unobserved'D'value: