Spark write working with memory errors in Shell but not with spark-submit

21 views Asked by At

I am trying to read 70Gb of data and apply filter and write the output to another S3 location(I am adding a coalesce(1000) before the write though), but this simple operation when when done using

Spark Submit gives the below error:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 5782, ip-10-70-21-40.ap-south-1.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  17.5 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

enter image description here

but when running in the spark-shell its runs and generates the data with success file while giving continues errors:

image

Submit config:

spark-submit   --deploy-mode cluster  --driver-memory 8g  --executor-memory 26g  --conf spark.executor.cores=4  --conf spark.executor.instances=10 --conf spark.sql.shuffle.partitions=400 --conf spark.default.parallelism=400 --conf spark.hadoop.fs.s3a.multipart.threshold=2097152000  --conf spark.hadoop.fs.s3a.multipart.size=104857600 --conf spark.hadoop.fs.s3a.maxRetries=4  --conf spark.hadoop.fs.s3a.connection.maximum=500 --conf spark.hadoop.fs.s3a.connection.timeout=600000 --conf spark.executor.memoryOverhead=3g  --conf spark.sql.caseSensitive=true --conf spark.task.maxFailures=4 --conf spark.network.timeout=600s --conf spark.sql.files.maxPartitionBytes=67108864   --conf spark.yarn.maxAppAttempts=1

Shell config:

spark-shell --master yarn --executor-memory 24g  --executor-cores 3 --driver-memory 8g --name shell --jars s3://xyz --conf spark.serializer=org.apache.spark.serializer.KryoSerializer

I tried changing and increasing spark.executor.memoryOverhead but even with absurdly big numbers I am still getting similar errors.

Can some one help me understand why this is happening ?

0

There are 0 answers