I have a job in AWS Glue that fails with:
An error occurred while calling o567.pyWriteDynamicFrame. Job aborted due to stage failure: Task 168 in stage 31.0 failed 4 times, most recent failure: Lost task 168.3 in stage 31.0 (TID 39474, ip-10-0-32-245.ec2.internal, executor 102): ExecutorLostFailure (executor 102 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
The main message is Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used.
I have used broadcasts for the small dfs and salt technique for bigger tables.
The input consists of 75GB of JSON files to process.
I have used a a grouping of 32MB for the input files:
additional_options={
'groupFiles': 'inPartition',
'groupSize': 1024*1024*32,
},
The output file is written with 256 partitions:
output_df = output_df.coalesce(256)
In AWS Glue I launch the job with 60 G.2X executors = 60 x (8 vCPU, 32 GB of memory, 128 GB disk).
Below is the plot representing the metrics for this job. From that, the data don't look skewed... Am I wrong?
Any advice to successfully run this is welcome!
Try to use
repartition
instead ofcoalesce
. The latter one will do the complete execution based on the number of the partitions you have provided. In your case it tries to process all the input data with the256 partitions
, when it can't handle the input data volume you will get the error.