Understanding AWS EMR Behavior with Spot Instances and Failed Queries Despite 'Completed' Status

89 views Asked by At

I run KPI aggregation queries every morning using AWS EMR. Since I am using spot instances, there are times when the servers are forcibly terminated due to resource shortages. In such cases, the EMR cluster status shows "Canceled," and I address this by re-executing the EMR. The status for each instance shows "Spot instance was terminated due to not enough capacity in the Spot instance pool," and no additional instances are added.

Recently, I noticed unusual behavior with EMR.
The EMR cluster status shows "Completed," but the logs indicate that some queries have failed. Normally, I set up the EMR configuration with one master node and three core (slave) nodes. In this instance, the initial three nodes were forcibly terminated due to insufficient resources, showing "Spot instance was terminated due to not enough capacity in the Spot instance pool." However, I discovered that three additional instances of the same type were launched. So, in total, six instances were used (excluding the master node).
My guess is that the initial three core nodes were forcibly terminated due to resource shortages, and three new instances were allocated, but the task handover failed, resulting in query failures, yet the status showed "Completed."

My question is whether the addition of these three core node instances is related to the AutoScaling setting in the EMR cluster configuration. Also, is there a setting to prevent automatic addition of instances?

Regarding EMR Cluster Configuration:

・ServiceRole is set to EMR_DefaultRole.
・JobFlowRole is set to EMR_EC2_DefaultRole.
・AutoScalingRole is set to EMR_AutoScaling_DefaultRole.
・The master node is set to one "m5.xlarge," and the core nodes to three "c5.24xlarge."
・Using EMR=5.28.0, Hadoop=2.8.5, Hive=2.3.6, Presto=0.227

Note:

・The query content is not the issue.
・Server logs are recorded but cannot be shared.
・The EMR cluster configuration file is not available for sharing.
・I have checked the instance usage with "Spot Instance Advisor" and found no issues.

This is my first post, and I apologize for any inconvenience in my writing. I would greatly appreciate any insights or advice. Thank you in advance.

0

There are 0 answers