Polars and Pandas DataFrame consume almost same memory. Where is the advantage of Polars?

1.5k views Asked by At

I wanted to compare memory consumption for same dataset. I read same SQL query with pandas and polars from an Oracle DB. Memory usage results are almost same. and execution time is 2 times faster than polars. I expect polars will be more memory efficient.

Is there anyone who can explain this? And any suggestion to reduce memory usage size for same dataset?

Polars Read SQL: enter image description here

Pandas Read SQL: enter image description here

result(polars) and data(pandas) shapes:

enter image description here

and lastly memory usages:

enter image description here

2

There are 2 answers

0
ignoring_gravity On BEST ANSWER

One of the big advantages of Polars is query optimisation

If you're loading all data into memory with read_database, and only doing that, then there will be no difference

On the other hand, if you make the dataframe you read in lazy (DataFrame.lazy), then perform some other operations, and then collect the results (LazyFrame.collect), then that's where you'll see the Polars shine

Note: usually you'll want to read the data in lazily directly (e.g. scan_parquet instead of read_parquet) but for read_database there is no scan_ equivalent

3
Dean MacGregor On

The polars memory efficiency claim isn't about data at rest. It's about the memory overhead of performing operations on the data.

Here's a good demo of that

Screenshots: enter image description here

You can see how, on the left, polars takes about 2 sec and has a small bump in memory usage (maybe ~5%). On the right, pandas takes about 40 sec and needs about 20% of system memory. The reason polars is faster is two-fold. As you can see, on the polars side, all the CPU threads are going to 100% but on the pandas side it's just 1 at a time and they aren't even sustained. The second reason is that memory inefficiency means it's copying unnecessarily which is slow.

Speed is 20x faster and memory usage is 25% of pandas.

As an aside on the speed difference noted in the question. polars doesn't natively read databases, it uses the connectorx library which uses some optimizations to load data faster by creating chunks and getting those chunks in parallel. You can read about that here. If the database backend can't handle performing the query any faster then trying to do it in chunks might make it slower than letting the query happen as a single call. Additionally, as presented, the db query is done with polars first and then pandas after. Databases usually cache results so if you run the same query twice in a row the second time will be faster regardless of the 3rd party library making the query.