Crosstab in pandas with enforced values

1.6k views Asked by At

I get some data samples, e.g,

df = pd.DataFrame({'A':[1,1,3,3,4],
                   'B':['Very Happy','Sad','Sad','Happy','Happy'],
                   'C': [True,False,False,True,False]})
>> df 


   A           B      C
0  1  Very Happy   True
1  1         Sad  False
2  3         Sad  False
3  3       Happy   True
4  4       Happy  False

And want to compute counts for each combination, so crosstab is the way to go

counts = pd.crosstab(index = df['A'], columns = [df['B'],df['C']])
>> counts


B Happy         Sad Very Happy
C False True  False      True 
A                             
1     0     0     1          1
3     0     1     1          0
4     1     0     0          0

However, the neither 'Very Sad' emotion nor id 2 happened to show up in this data sample, so it is not in the crosstab. I would like to have it as

  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
1          1     0     0     0     0     1        0     0
2          0     0     0     0     0     0        0     0
3          0     0     1     0     0     1        0     0
4          0     0     0     1     0     0        0     0

My workaround is to set up a template

emotions = ['Very Happy', 'Happy', 'Sad', 'Very Sad']
ids = [1,2,3,4]
truths = [True,False]

template = pd.DataFrame(index = pd.Index(ids),
                    columns= pd.MultiIndex.from_product((emotions,truths)))
>> template
  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
1        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN
2        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN
3        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN
4        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN

And then fill it

template.unstack()[counts.unstack().index] = counts.unstack()
template = template.fillna(0)
>> template
  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
1          1     0     0     0     0     1        0     0
2          0     0     0     0     0     0        0     0
3          0     0     1     0     0     1        0     0
4          0     0     0     1     0     0        0     0

The problem is that it feels like there must be a cleaner, more readable way to achieve the same result. Any ideas?

1

There are 1 answers

1
behzad.nouri On BEST ANSWER

That is a pivot_table:

>>> pv = df.pivot_table(index='A',
...                     columns=['B', 'C'],
...                     aggfunc='size',
...                     fill_value=0)
>>> pv
B Happy         Sad Very Happy
C False True  False      True 
A                             
1     0     0     1          1
3     0     1     1          0
4     1     0     0          0

The columns/rows which do not appear there are becuase their cross section does not exist in the frame. You may add them by doing a .reindex:

>>> cols = pd.MultiIndex.from_product((['Very Happy', 'Happy', 'Sad', 'Very Sad'], [True, False]))
>>> pv.reindex(index=range(1, 5), columns=cols, fill_value=0)
  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
A                                                        
1          1     0     0     0     0     1        0     0
2          0     0     0     0     0     0        0     0
3          0     0     1     0     0     1        0     0
4          0     0     0     1     0     0        0     0