In a Sparse dataframe, the sum() method applied on the whole dataframe gives wrong results, while sum() applied to specific column or to a dataframe subset works.
It looks like an overflow issue for sum() when applied to the whole dataframe, since type Sparse[int8, 0] is chosen for sum result. However, why isn't that the case for the other two scenarios?
Note: Strangely, when run in Anaconda terminal, each scenario gives correct result, while in Pycharms I see the error.
>>> import numpy as np
>>> import pandas as pd
>>> # Generate standard and sparse DF with binary variable.
>>> # Use int8 to minimize memory usage.
>>> df = pd.DataFrame(np.random.randint(low=0, high=2, size=(50_000, 1)))
>>> sdf = df.astype(pd.SparseDtype(dtype='int8', fill_value=0))
>>> print(df.sum(axis=0))
0 24954
dtype: int64
>>> # Why does this give a wrong answer while the other two work?
>>> print(sdf.sum(axis=0))
0 122
dtype: Sparse[int8, 0]
>>> # Works
>>> print(sdf[0].sum())
24954
>>> # Works
>>> print(sdf[sdf==1].sum())
0 24954.0
dtype: float64
Finally, what's a safe way for summing Sparse df columns without going dense or changing the dtype? I currently iterate over each column and save the sum() result in a dictionary (similar to Scenario 2 in this example), then transform to dataframe, which seems a bit cumbersome.
Unfortunately, I think there is probably no good answer to your question. I would rather use scipy if I had to deal with sparse matrices:
However, note the ticket opened by a Pandas member: DEPR: SparseDtype #56518