Pandas with pyarrow does not use additional memory when splitting dataframe

46 views Asked by At

When using the float64[pyarrow] dtype in Pandas 2.2.1, it appears that no additional memory is used when splitting a dataframe in two, and then joining it back together again.

When the regular float64 dtype is used, this uses 3x the memory of the original dataframe (which is what I'd intuitively expect, if the dataframe is copied).

Can anyone explain why this happens? It's obviously a good thing, but this doesn't seem to be listed under the benefits of pyarrow, so I'd like to understand why it happens.

The code I'm running is:

import gc
import os

import numpy as np
import pandas as pd
import psutil


def log_memory_usage():
    gc.collect()
    pid = os.getpid()
    p = psutil.Process(pid)
    full_info = p.memory_info()
    print(
        f"Memory usage: {full_info.rss / 1024 / 1024:.2f} MB (RSS)"
    )


log_memory_usage()
df1 = pd.DataFrame(np.ones(shape=(10000000, 10)), columns=[f"col_{i}" for i in range(10)], dtype="float64")
log_memory_usage()
split1 = df1.loc[:, df1.columns[:5]]
split2 = df1.loc[:, df1.columns[5:]]
log_memory_usage()
joined_again = pd.concat([split1, split2], axis=1)
log_memory_usage()

This prints:

Memory usage: 68.28 MB (RSS), 33685.60 MB (VMS),  68.29 MB (Peak RSS)
Memory usage: 831.38 MB (RSS), 34448.54 MB (VMS),  831.38 MB (Peak RSS)
Memory usage: 1594.66 MB (RSS), 35211.49 MB (VMS),  1594.66 MB (Peak RSS)
Memory usage: 2346.45 MB (RSS), 35974.43 MB (VMS),  2346.45 MB (Peak RSS)

So splitting and concatenating the dataframe uses additional memory each time.

But if I change dtype="float64" to dtype="float64[pyarrow]" I get the following output:

Memory usage: 68.14 MB (RSS)
Memory usage: 833.51 MB (RSS)
Memory usage: 833.84 MB (RSS)
Memory usage: 833.93 MB (RSS)

So it appears that very little additional memory is used for the split and joined versions of the dataframe.

1

There are 1 answers

0
Neil On

This appears to be a copy-on-write optimization. If I add this to the code:

pd.set_option("mode.copy_on_write", True)

Both dtypes result in reduced memory usage.

So perhaps pyarrow has copy-on-write enabled by default? I couldn't find this in the docs anywhere though.