Polars and Pandas DataFrame consume almost same memory. Where is the advantage of Polars?

Question

Polars and Pandas DataFrame consume almost same memory. Where is the advantage of Polars?

1.4k views Asked by Atacan At 02 November 2023 at 13:23

I wanted to compare memory consumption for same dataset. I read same SQL query with pandas and polars from an Oracle DB. Memory usage results are almost same. and execution time is 2 times faster than polars. I expect polars will be more memory efficient.

Is there anyone who can explain this? And any suggestion to reduce memory usage size for same dataset?

Polars Read SQL:

Pandas Read SQL:

result(polars) and data(pandas) shapes:

and lastly memory usages:

Original Q&A

There are 2 answers

Dean MacGregor On 02 November 2023 at 15:26

The polars memory efficiency claim isn't about data at rest. It's about the memory overhead of performing operations on the data.

Here's a good demo of that

Screenshots:

You can see how, on the left, polars takes about 2 sec and has a small bump in memory usage (maybe ~5%). On the right, pandas takes about 40 sec and needs about 20% of system memory. The reason polars is faster is two-fold. As you can see, on the polars side, all the CPU threads are going to 100% but on the pandas side it's just 1 at a time and they aren't even sustained. The second reason is that memory inefficiency means it's copying unnecessarily which is slow.

Speed is 20x faster and memory usage is 25% of pandas.

As an aside on the speed difference noted in the question. polars doesn't natively read databases, it uses the connectorx library which uses some optimizations to load data faster by creating chunks and getting those chunks in parallel. You can read about that here. If the database backend can't handle performing the query any faster then trying to do it in chunks might make it slower than letting the query happen as a single call. Additionally, as presented, the db query is done with polars first and then pandas after. Databases usually cache results so if you run the same query twice in a row the second time will be faster regardless of the 3rd party library making the query.

**ignoring_gravity** · Accepted Answer · 2023-11-02T13:39:43+00:00

One of the big advantages of Polars is query optimisation

If you're loading all data into memory with read_database, and only doing that, then there will be no difference

On the other hand, if you make the dataframe you read in lazy (DataFrame.lazy), then perform some other operations, and then collect the results (LazyFrame.collect), then that's where you'll see the Polars shine

Note: usually you'll want to read the data in lazily directly (e.g. scan_parquet instead of read_parquet) but for read_database there is no scan_ equivalent

TechQA.

Polars and Pandas DataFrame consume almost same memory. Where is the advantage of Polars?

There are 2 answers

The polars memory efficiency claim isn't about data at rest. It's about the memory overhead of performing operations on the data.

Speed is 20x faster and memory usage is 25% of pandas.

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PYTHON-POLARS

Related Questions in RUST-POLARS

Popular Questions

Popular Tags

Trending Questions