NetTimeout when using Ruby AWS SDK to invoke a synchronous Lambda in threads

101 views Asked by At

This has got me really stumped and hope that someone might have some insight on this. We are using Ruby and the AWS SDK to invoke a AWS Lambda synchronously. The time it take for the Lambda to complete is "usually" no more than 7 minutes.

The timeout defined in the AWS console for our Lambda is 600 seconds (10 minutes) The configuration we have AWS Lambda client object is :

{
  http_read_timeout: 600,
  max_attempts: 1,
  retry_limit: 0
}

We have a requirement that we need to invoke this Lambda multiple times in threads. Each thread would use a different AWS Lambda Client object (but with the same above configuration) and a different event payload passed into it upon invocation. Our program waits for all the thread(s) doing the invocation to complete.

Locally, from our computers this works very reliably. However, when our program is run within ECS then we get NET::Timeout TCP socket errors. The Lambda will invoke X times. The code in the Lambda will succeed. But the AWS Lambda client in the thread(s) doing the invocation - reach the 600 timeout without receiving the response from the Lambda and fail with the NET::Timeout TCP error.

We could change our design such that:

  • Invoke the Lambda asynchronously N times.
  • Each invocation then publish the result in the form of an AWS SNS topic.
  • An AWS SQS queue is subscribed to the topic and the incoming messages go onto the queue
  • Our program then polls the SQS queue, retrieving the messages and completing the feedback loop.

But that is not a trivial re-design and refactor. Possibly a 6 days work of dev/test - which we don't have.

But I would be v.grateful to anyone whom has any valuable insight into this problem. Be good to have a dialogue and share some ideas.

Thank you kindly guys!

  • Our approach and program works reliably locally from our computers.
  • The problem is reproduced when our program is run as a micro service in ECS.
  • I intend to test this locally in Docker container and see what happens. I wonder if this related to a limitation of sockets available within a Docker container ?
2

There are 2 answers

5
rr3tt On

Locally, from our computers this works very reliably. However, when our program is run within ECS then we get NET::Timeout TCP socket errors. The Lambda will invoke X times.

If it still is able to invoke the Lambda function from ECS but it times out perhaps something somewhere in the networking between them is timing it out, something with ECS, load balancer, proxy, something like that? May be worth investigating.

Tough to know without all the details but in general I would probably go for an async design as it will be able to scale and handle failures better. Instead of calling the Lambda function async directly, your threads could push messages to an SQS queue which the Lambda function reads from.

1
user2622636 On

We think we cracked this. Hopefully, if anyone has a similar problem, hopefully this will help them out.

The problem was client side. Specifically in the alpine docker container OS. We needed to

a). Set the tcp_keepalive_time = 300

https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

b). Set the tcp_syn_retries = 8

https://man7.org/linux/man-pages/man7/tcp.7.html http://willbryant.net/overriding_the_default_linux_kernel_20_second_tcp_socket_connect_timeout

We found that our program in the ECS container was successfully sending an API call to the AWS Lambda API. The Lambda API received it, triggered the Lambda. But socket on the client-side (our ECS container) was being closed. However, our app - was completely unaware of this.

We also had to monkey patch the Net::HTTP as described in Increase connect(2) timeout in RestClient / Net::HTTP on AWS Linux