I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())
output
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
The stack and group by sum are just the same.
However, the one I expect is
11 x 2
11 y 6
12 x 6
12 y 10
EDIT 2:
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
output:
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
EDIT3: An issue has been logged https://github.com/pydata/pandas/issues/10417
With pandas 0.16.2 and Python 3, I was able to get the correct result via:
Which produces:
You can then change the index and column names to more desirable ones using
reindex()
andcolumns
.Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on
Series
, which is whatx.stack()
produces. My workaround is to turn theSeries
into aDataFrame
viareset_index()
. In this case theDataFrame
does not have aMultiIndex
anymore - I'm just grouping on labeled columns.To make sure that grouping and summing works on a
DataFrame
with aMultiIndex
, you can try this to get the same correct output:Either of these workarounds should take care of things until the bug is resolved.
I wonder if the bug has something to do with the
MultiIndex
instances that are created on aSeries
vs. aDataFrame
. For example:vs.
Notice how the
MultiIndex
on theDataFrame
describes the levels more correctly.