RDD Memory footprint in spark

Question

RDD Memory footprint in spark

1.6k views Asked by spark_dream At 05 May 2016 at 21:04

I'm not sure on the concept of memory foot print. When loading a parquet file of eg. 1GB and creating RDDs out of it in Spark, What would be the memory food print for each RDD?

Original Q&A

There are 2 answers

user2533495 On 05 August 2017 at 17:40

Marios, in your memory projection you didn't take into account the compression of Parquet. 1Gb could very well be 5GB uncompressed.

**marios** · Accepted Answer · 2016-05-05T22:06:37+00:00

When you create an RDD out of a parquet file, nothing will be loaded/executed until you run an action (e.g., first, collect) on the RDD.

Now your memory footprint will most likely vary over time. Say you have 100 partitions and they are equally-sized (10 MB each). Say you are running on a cluster with 20 cores, then at any point in time you only need to have 10MB x 20 = 200MB data in memory.

To add on top of this, given that Java objects tend to take more space, it's not easy to say exactly how much space your 1GB file will take in the JVM Heap (assuming you load the entire file). It could me 2x or it can be more.

One trick you can do to test this is force your RDD to be cached. You can then check in the Spark UI under Storage and see how much space that RDD took to cache.

TechQA.

RDD Memory footprint in spark

There are 2 answers

Related Questions in APACHE-SPARK

Related Questions in COMPRESSION

Related Questions in RDD

Related Questions in PARQUET

Related Questions in MEMORY-FOOTPRINT

Popular Questions

Trending Questions