I am trying to understand how to set up a sparse pandas matrix to minimize memory usage and retain precision of all values. I did not find the answers in the pandas Sparse documentation. Below is an example which illustrates my questions:
Why does a
Sparse(int32)dataframe take as much memory as aSparse(float32)dataframe? Is there any advantage in specifying aSparse(int)dtype if this is the case?How does pandas decide what specific
Sparse(int)dtype to use, e.g.int8orint32? Given the example below (please see dataframessdf_int32andsdf_high_int32), it appearsSparse(int32)is always chosen regardless of whetherSparse(int8)might be more memory-efficient, orSparse(int32)might truncate some values.Is the only way to avoid truncation and achieve minimum memory usage to specify
Sparse(intNN)orSparse(floatNN)dtype for each column?
import numpy as np
import pandas as pd
# Generate binary dense matrix with low density
df = pd.DataFrame()
for col in ['col1', 'col2', 'col3']:
df[col] = np.where(np.random.random_sample(100_000_000) > 0.98, 1, 0)
df.name = 'Dense'
# Replace one column by values too high for int32 dtype
df_high = df.copy()
df_high['col1'] = df_high['col1'] * 100_000_000_000
# Convert df to sparse of various dtypes
sdf_float32 = df.astype(pd.SparseDtype('float32', 0))
sdf_float32.name = 'Sparse, float32'
sdf_int8 = df.astype(pd.SparseDtype('int8', 0))
sdf_int8.name = 'Sparse, int8'
sdf_int32 = df.astype(pd.SparseDtype('int', 0))
sdf_int32.name = 'Sparse, int32'
sdf_int64 = df.astype(pd.SparseDtype('int64', 0))
sdf_int64.name = 'Sparse, int64'
# Convert df_high to Sparse(int)
sdf_high_int32 = df_high.astype(pd.SparseDtype('int', 0))
sdf_high_int32.dtypes
sdf_high_int32['col1'].value_counts()
sdf_high_int32.name = 'Sparse, int32 highval'
# Print info for all dataframes
print(f" {df.name} Dataframe; Memory size: {df.memory_usage(deep=True).sum() / 1024 ** 2:.1f} MB, {df['col1'].dtype}")
for data in [sdf_float32, sdf_int8, sdf_int32, sdf_high_int32, sdf_int64]:
print(f" {data.name} Dataframe; Memory size: {data.memory_usage(deep=True).sum() / 1024**2:.1f} MB,"
f"Density {data.sparse.density:.5%}, {data['col1'].dtype}")
"""
Dense Dataframe; Memory size: 1144.4 MB, int32
Sparse, float32 Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[float32, 0]
Sparse, int8 Dataframe; Memory size: 28.6 MB,Density 1.99980%, Sparse[int8, 0]
Sparse, int32 Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[int32, 0]
Sparse, int32 highval Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[int32, 0]
Sparse, int64 Dataframe; Memory size: 68.7 MB,Density 1.99980%, Sparse[int64, 0]
"""
# Show truncated values for sdf_high_int32
print(f"Values for sdf_high_int32, col1: \n {sdf_high_int32['col1'].value_counts()}")
"""
Values for sdf_high_int32, col1:
col1
0 98001473
1215752192 1998527
Name: count, dtype: int64
"""
There are two questions in your question, first about sparse matrices. Here is the pandas documentation:
That means that only the value chosen not to be stored (
0in your case) is not stored. The other values are stored as the datatype you have chosen.float32andint32both use 32 bits to represent a value, so they consume the same memory. The difference is what values they can store at what precision. The same would hold true forint64versusfloat64.Since you only stored 0s and 1s, in your case, you can pick
int8as well for storingdf.Now, answering your
intquestion. Your platform seems to interpretintasint32. On my platformintis equivalent toint64. Numpy is responsible for this and here is a bit from the numpy documentation:Because in your case
int32was chosen, you see the values of0(obviously) and1215752192. The latter being100_000_000_000stored in anint32, i.e., there was an overflow and it being stored as100_000_000_000 % (2**32)(run this in python) which gives1215752192.BTW, here are the relevant parts from my python interpreter: