Removing categories with patsy and statsmodels

514 views Asked by At

I am using statsmodels and patsy for building a logistic regression model. I'll use pseudocode here. Let's assume I have a dataframe containing a categorical variable, say Country, with 200 levels. I have reasons to believe some of them would be predictive, so I build a model as in

formula = 'outcome ~  C(Country)'

patsy splits Country into its levels and the model is build using all countries. I then see that the coefficient in GB is high so I want to remove only GB. Can I do something like this in patsy:

formula = 'outcome ~ C(country) - C(country)[GB]'

I tried and it did not change anything.

1

There are 1 answers

0
Max Pierini On

I don't know if there is a way to subset a Category with patsy formula, but you can do it in the DataFrame.

For example

import numpy as np
import pandas as pd
import statsmodels.api as sm

# sample data
size = 100
np.random.seed(1)
countries = ['IT', 'UK', 'US', 'FR', 'ES']
df = pd.DataFrame({
    'outcome': np.random.random(size),
    'Country': np.random.choice(countries, size)
})
df['Country'] = df.Country.astype('category')

print(df.Country)

0     ES
1     IT
2     UK
3     US
4     UK
      ..
95    FR
96    UK
97    ES
98    UK
99    US
Name: Country, Length: 100, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']

Let us suppose we want to remove Category "US"

# create a deep copy excluding 'US'
_df = df[df.Country!='US'].copy(deep=True)
print(_df.Country)

0     ES
1     IT
2     UK
4     UK
5     ES
      ..
94    UK
95    FR
96    UK
97    ES
98    UK
Name: Country, Length: 83, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']

Even if there are no more elements with category "US" in the DataFrame, the category is still there. If we use this DataFrame in a statsmodels model, we'd get a singular matrix error, so we need to remove unused categories

# remove unused category 'US'
_df['Country'] = _df.Country.cat.remove_unused_categories()
print(_df.Country)

0     ES
1     IT
2     UK
4     UK
5     ES
      ..
94    UK
95    FR
96    UK
97    ES
98    UK
Name: Country, Length: 83, dtype: category
Categories (4, object): ['ES', 'FR', 'IT', 'UK']

and now we can fit a model

mod = sm.Logit.from_formula('outcome ~ Country', data=_df)
fit = mod.fit()
print(fit.summary())

Optimization terminated successfully.
         Current function value: 0.684054
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                outcome   No. Observations:                   83
Model:                          Logit   Df Residuals:                       79
Method:                           MLE   Df Model:                            3
Date:                Sun, 16 May 2021   Pseudo R-squ.:                 0.01179
Time:                        22:43:37   Log-Likelihood:                -56.776
converged:                       True   LL-Null:                       -57.454
Covariance Type:            nonrobust   LLR p-value:                    0.7160
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -0.1493      0.438     -0.341      0.733      -1.007       0.708
Country[T.FR]     0.4129      0.614      0.673      0.501      -0.790       1.616
Country[T.IT]    -0.1223      0.607     -0.201      0.840      -1.312       1.068
Country[T.UK]     0.1027      0.653      0.157      0.875      -1.178       1.383
=================================================================================