Random "Unable to locate credentials" in the child processes on AWS EC2

150 views Asked by At

Bounty clarification

I already found band-aid solution, see my answer.

However, I still do not understand

  1. At what point script below is throttled - when it calls IMDS? or when IMDS calls STS?
  2. Why passing credentials from parent process to children did not help? It seems like throttling happens not at credentials request but at some later stage, smth like "verify credentials passed through env variables" or whatever.

Bounty will be awarded for the answering of any of these two questions, reference to the docs/source code are appreciated.

Please do not suggest rewrite in Python or any other cool language, it is out of the scope in this question

Original question

I have an AWS EC2 host with 96 vCPUs

I execute following script which spans async child processes up to a certain limit (passed as a script param). Each async child process calls AWS CLI and exits

#!/bin/bash

function f() {
 p=$1
 echo "$(date) entering f($p)"
 aws sts get-caller-identity
 echo "$(date) leaving f($p)"
}

a=""
for i in {1..400}; do
  a="$a
  $i"
done

jobs_limit=$1

while read line; do
    jobs=$(jobs -r | wc -l)
    while [ $jobs -ge $jobs_limit ]; do
      sleep 0.1
      jobs=$(jobs -r | wc -l)
    done
    echo "$(f $line)" &
done < <(echo "$a")

wait

Problem: when there are more than 40 child processes, then some processes randomly get Unable to locate credentials. You can configure credentials by running "aws configure". error.

Here is output when script is executed with different limits

sh-4.2$ /tmp/b.sh 40 2>&1 | egrep 'Unable to locate credentials' | wc -l
0

sh-4.2$ /tmp/b.sh 40 2>&1 | egrep 'Unable to locate credentials' | wc -l
0

sh-4.2$ /tmp/b.sh 44 2>&1 | egrep 'Unable to locate credentials' | wc -l
22

sh-4.2$ /tmp/b.sh 44 2>&1 | egrep 'Unable to locate credentials' | wc -l
7

sh-4.2$ /tmp/b.sh 48 2>&1 | egrep 'Unable to locate credentials' | wc -l
18

sh-4.2$ /tmp/b.sh 48 2>&1 | egrep 'Unable to locate credentials' | wc -l
14

As you see above, zero errors for 40, random number of errors when more than 40.

Can someone explain what's going on?


Had an idea that it may be calling /metadata endpoint to get creds and it limits somehow requests rate, so I changed to get creds in parent process and pass these to child but no luck

function f() {
 p=$1
 cc=$2
 echo "$(date) entering f($p)"
  k=$(echo $cc | jq .AccessKeyId -r)
  s=$(echo $cc | jq .SecretAccessKey -r)
  t=$(echo $cc | jq .Token -r)
  AWS_ACCESS_KEY_ID=$k AWS_SECRET_ACCESS_KEY=$s AWS_SESSION_TOKEN=$t aws sts get-caller-identity
 echo "$(date) leaving f($p)"
}

...

creds=$(curl http://169.254.169.254/latest/meta-data/iam/security-credentials/EMRJobFlowRole)

...

   echo "$(f $line $creds)" &

...
1

There are 1 answers

2
Alexander Pavlov On

@jarmod pointed me to right direction. If I reduce the rate at which I span children, then problem goes away.

See extra sleep 0.050 at the end of the loop.

#!/bin/bash

function f() {
 p=$1
 echo "$(date) entering f($p)"
 aws sts get-caller-identity
 echo "$(date) leaving f($p)"
}

a=""
for i in {1..400}; do
  a="$a
  $i"
done

jobs_limit=$1

while read line; do
    jobs=$(jobs -r | wc -l)
    while [ $jobs -ge $jobs_limit ]; do
      sleep 0.050
      jobs=$(jobs -r | wc -l)
    done
    echo "$(f $line)" &

    # Otherwise IMDS throttles us
    sleep 0.05

done < <(echo "$a")

wait

Still open question: if it is IMDS related issue, then why it does not work when I pass credentials to every child process? Why does it call IMDS in this scenario?