I am facing the java.lang.OutOfMemoryError: Java Heap Space issue when I run the same spark program every 2nd time.
Here is a scenario:
When I do the spark-submit and runs the spark program for the first time, it gives me the correct output & everything is fine. When I execute the same spark-submit one more time, it is throwing java.lang.OutOfMemoryError: Java Heap Space exception.
When it again works?
If I run the same spark-submit after clearing the linux cache by executing - /proc/sys/vm/drop_caches it again runs successfully for one single time.
I tried setting all possible spark configs like memoryOverhead, drive-memory, executor-memory, etc.
Any idea whats happening here? Is this really a problem with spark code, or its happening because of some linux machine setting or the way cluster is configured?
Thanks.
In case of using
df.persist()ordf.cache()then you should be also usingdf.unpersist()method and there's alsosqlContext.clearCache()which clears all.