When using the float64[pyarrow] dtype in Pandas 2.2.1, it appears that no additional memory is used when splitting a dataframe in two, and then joining it back together again.
When the regular float64 dtype is used, this uses 3x the memory of the original dataframe (which is what I'd intuitively expect, if the dataframe is copied).
Can anyone explain why this happens? It's obviously a good thing, but this doesn't seem to be listed under the benefits of pyarrow, so I'd like to understand why it happens.
The code I'm running is:
import gc
import os
import numpy as np
import pandas as pd
import psutil
def log_memory_usage():
gc.collect()
pid = os.getpid()
p = psutil.Process(pid)
full_info = p.memory_info()
print(
f"Memory usage: {full_info.rss / 1024 / 1024:.2f} MB (RSS)"
)
log_memory_usage()
df1 = pd.DataFrame(np.ones(shape=(10000000, 10)), columns=[f"col_{i}" for i in range(10)], dtype="float64")
log_memory_usage()
split1 = df1.loc[:, df1.columns[:5]]
split2 = df1.loc[:, df1.columns[5:]]
log_memory_usage()
joined_again = pd.concat([split1, split2], axis=1)
log_memory_usage()
This prints:
Memory usage: 68.28 MB (RSS), 33685.60 MB (VMS), 68.29 MB (Peak RSS)
Memory usage: 831.38 MB (RSS), 34448.54 MB (VMS), 831.38 MB (Peak RSS)
Memory usage: 1594.66 MB (RSS), 35211.49 MB (VMS), 1594.66 MB (Peak RSS)
Memory usage: 2346.45 MB (RSS), 35974.43 MB (VMS), 2346.45 MB (Peak RSS)
So splitting and concatenating the dataframe uses additional memory each time.
But if I change dtype="float64" to dtype="float64[pyarrow]" I get the following output:
Memory usage: 68.14 MB (RSS)
Memory usage: 833.51 MB (RSS)
Memory usage: 833.84 MB (RSS)
Memory usage: 833.93 MB (RSS)
So it appears that very little additional memory is used for the split and joined versions of the dataframe.
This appears to be a copy-on-write optimization. If I add this to the code:
Both dtypes result in reduced memory usage.
So perhaps pyarrow has copy-on-write enabled by default? I couldn't find this in the docs anywhere though.