setting levels apriori when using factorize in Pandas to cover missing cases

Question

setting levels apriori when using factorize in Pandas to cover missing cases

237 views Asked by tumultous_rooster At 17 November 2014 at 03:52

I understand how to use factorize to encode levels of a factor, such as "L" and "W" (for wins and loses) into numeric values, such as "0" and "1":

import pandas as pd
first_df = pd.DataFrame({'outcome': ["L", "L", "W", "W"]})
pd.factorize(first_df['outcome'])

The above returns (array([0, 0, 1, 1]), array(['L', 'W'], dtype=object)).

However, later on, I'd like to combine this result with some other results, where we now have a new outcome, a draw ("D"), and here is where things get sticky:

second_df = pd.DataFrame({'outcome': ["L", "L", "D", "D"]})
pd.factorize(second_df['outcome'])

This returns (array([0, 0, 1, 1]), array(['L', 'D'], dtype=object))

I need some way to preemptively declare the fact that there are 3 different levels when I create the dataframes, and map the correct numeric value to the correct level. How can I achieve this?

Original Q&A

There are 1 answers

**Marius** · Accepted Answer · 2014-11-17T04:32:49+00:00

Something like this is definitely possible using a Categorical:

outcome_cat = pd.Categorical(
    first_df['outcome'], 
    categories=['L', 'W', 'D'], ordered=False
)

The semantics of Categoricals may not be exactly the same as the output of pd.factorize(), but the codes attribute contains your data as numeric values, it's just that the Categorical is also aware of the unobserved 'D' value:

outcome_cat.codes
Out[6]: array([0, 0, 1, 1], dtype=int8)

TechQA.

setting levels apriori when using factorize in Pandas to cover missing cases

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in LEVELS

Related Questions in CATEGORICAL-DATA

Popular Questions

Popular Tags

Trending Questions