Why does pandas sum() give wrong answers for Sparse dataframe?

Question

Why does pandas sum() give wrong answers for Sparse dataframe?

94 views Asked by Dudelstein At 03 February 2024 at 09:52

In a Sparse dataframe, the sum() method applied on the whole dataframe gives wrong results, while sum() applied to specific column or to a dataframe subset works.

It looks like an overflow issue for sum() when applied to the whole dataframe, since type Sparse[int8, 0] is chosen for sum result. However, why isn't that the case for the other two scenarios?

Note: Strangely, when run in Anaconda terminal, each scenario gives correct result, while in Pycharms I see the error.

>>> import numpy as np
>>> import pandas as pd

>>> # Generate standard and sparse DF with binary variable.
>>> # Use int8 to minimize memory usage.
>>> df = pd.DataFrame(np.random.randint(low=0, high=2, size=(50_000, 1)))
>>> sdf = df.astype(pd.SparseDtype(dtype='int8', fill_value=0))
>>> print(df.sum(axis=0))
0    24954
dtype: int64

>>> # Why does this give a wrong answer while the other two work?
>>> print(sdf.sum(axis=0))
0    122
dtype: Sparse[int8, 0]

>>> # Works
>>> print(sdf[0].sum())
24954

>>> # Works
>>> print(sdf[sdf==1].sum())
0    24954.0
dtype: float64

Finally, what's a safe way for summing Sparse df columns without going dense or changing the dtype? I currently iterate over each column and save the sum() result in a dictionary (similar to Scenario 2 in this example), then transform to dataframe, which seems a bit cumbersome.

Original Q&A

There are 1 answers

**Corralien** · Accepted Answer · 2024-02-03T12:10:44+00:00

Unfortunately, I think there is probably no good answer to your question. I would rather use scipy if I had to deal with sparse matrices:

import pandas as pd
from scipy.sparse import csr_matrix

df = pd.DataFrame(np.random.randint(low=0, high=2, size=(50_000, 3)))
sdf = csr_matrix(df, dtype='int8')

>>> sdf 
<50000x3 sparse matrix of type '<class 'numpy.int8'>'
    with 75298 stored elements in Compressed Sparse Row format>

>>> sdf.sum(axis=0)
matrix([[24963, 25202, 25133]])

>>> pd.DataFrame(sdf.sum(axis=0), columns=df.columns)
       0      1      2
0  24963  25202  25133

However, note the ticket opened by a Pandas member: DEPR: SparseDtype #56518

TechQA.

Why does pandas sum() give wrong answers for Sparse dataframe?

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in SUM

Related Questions in SPARSE-MATRIX

Related Questions in INTEGER-OVERFLOW

Popular Questions

Trending Questions