setting levels apriori when using factorize in Pandas to cover missing cases

237 views Asked by At

I understand how to use factorize to encode levels of a factor, such as "L" and "W" (for wins and loses) into numeric values, such as "0" and "1":

import pandas as pd
first_df = pd.DataFrame({'outcome': ["L", "L", "W", "W"]})
pd.factorize(first_df['outcome'])

The above returns (array([0, 0, 1, 1]), array(['L', 'W'], dtype=object)).

However, later on, I'd like to combine this result with some other results, where we now have a new outcome, a draw ("D"), and here is where things get sticky:

second_df = pd.DataFrame({'outcome': ["L", "L", "D", "D"]})
pd.factorize(second_df['outcome'])

This returns (array([0, 0, 1, 1]), array(['L', 'D'], dtype=object))

I need some way to preemptively declare the fact that there are 3 different levels when I create the dataframes, and map the correct numeric value to the correct level. How can I achieve this?

1

There are 1 answers

0
Marius On BEST ANSWER

Something like this is definitely possible using a Categorical:

outcome_cat = pd.Categorical(
    first_df['outcome'], 
    categories=['L', 'W', 'D'], ordered=False
)

The semantics of Categoricals may not be exactly the same as the output of pd.factorize(), but the codes attribute contains your data as numeric values, it's just that the Categorical is also aware of the unobserved 'D' value:

outcome_cat.codes
Out[6]: array([0, 0, 1, 1], dtype=int8)