Bounty clarification
I already found band-aid solution, see my answer.
However, I still do not understand
- At what point script below is throttled - when it calls IMDS? or when IMDS calls STS?
- Why passing credentials from parent process to children did not help? It seems like throttling happens not at credentials request but at some later stage, smth like "verify credentials passed through env variables" or whatever.
Bounty will be awarded for the answering of any of these two questions, reference to the docs/source code are appreciated.
Please do not suggest rewrite in Python or any other cool language, it is out of the scope in this question
Original question
I have an AWS EC2 host with 96 vCPUs
I execute following script which spans async child processes up to a certain limit (passed as a script param). Each async child process calls AWS CLI and exits
#!/bin/bash
function f() {
p=$1
echo "$(date) entering f($p)"
aws sts get-caller-identity
echo "$(date) leaving f($p)"
}
a=""
for i in {1..400}; do
a="$a
$i"
done
jobs_limit=$1
while read line; do
jobs=$(jobs -r | wc -l)
while [ $jobs -ge $jobs_limit ]; do
sleep 0.1
jobs=$(jobs -r | wc -l)
done
echo "$(f $line)" &
done < <(echo "$a")
wait
Problem: when there are more than 40 child processes, then some processes randomly get Unable to locate credentials. You can configure credentials by running "aws configure".
error.
Here is output when script is executed with different limits
sh-4.2$ /tmp/b.sh 40 2>&1 | egrep 'Unable to locate credentials' | wc -l
0
sh-4.2$ /tmp/b.sh 40 2>&1 | egrep 'Unable to locate credentials' | wc -l
0
sh-4.2$ /tmp/b.sh 44 2>&1 | egrep 'Unable to locate credentials' | wc -l
22
sh-4.2$ /tmp/b.sh 44 2>&1 | egrep 'Unable to locate credentials' | wc -l
7
sh-4.2$ /tmp/b.sh 48 2>&1 | egrep 'Unable to locate credentials' | wc -l
18
sh-4.2$ /tmp/b.sh 48 2>&1 | egrep 'Unable to locate credentials' | wc -l
14
As you see above, zero errors for 40, random number of errors when more than 40.
Can someone explain what's going on?
Had an idea that it may be calling /metadata
endpoint to get creds and it limits somehow requests rate, so I changed to get creds in parent process and pass these to child but no luck
function f() {
p=$1
cc=$2
echo "$(date) entering f($p)"
k=$(echo $cc | jq .AccessKeyId -r)
s=$(echo $cc | jq .SecretAccessKey -r)
t=$(echo $cc | jq .Token -r)
AWS_ACCESS_KEY_ID=$k AWS_SECRET_ACCESS_KEY=$s AWS_SESSION_TOKEN=$t aws sts get-caller-identity
echo "$(date) leaving f($p)"
}
...
creds=$(curl http://169.254.169.254/latest/meta-data/iam/security-credentials/EMRJobFlowRole)
...
echo "$(f $line $creds)" &
...
@jarmod pointed me to right direction. If I reduce the rate at which I span children, then problem goes away.
See extra
sleep 0.050
at the end of the loop.Still open question: if it is IMDS related issue, then why it does not work when I pass credentials to every child process? Why does it call IMDS in this scenario?