I am working on a project which uses an SQS triggered Lambda to ingest some content into AWS Elasticsearch.
The Lambda and the Elasticsearch service are located in the same VPC.
The source code is very simple and its size is 10.5kB (mostly static resources ie. xsl files).
The libraries used are packaged in a separate layer.
When I first deploy the Lambda, everything works correctly, the lambda gets invoked thousands of times and for about a day or two, everything works as expected. However, it then starts timing out and once it does, it always does until I do a fresh redeploy.
This occurs whether I use the elasticsearch-py
client or requests.get
.
Increasing the timeout or memory allocation does not help.
Recycling objects or re-instantiating everything on each invocation does not make any difference, either.
Has anyone experienced similar issues?
♂️ This turned out to be a problem with our deployment setup...
My project interacts with an Elasticsearch instance which is created by another team Terraform deployment.
When I create the resources for my project environment (also in Terraform), I add an
aws_security_group_rule
to the existing security group:When the other team re-apply, the rule is deleted...
The fix for this was to get the other team to define their rule as a separate resource instead of inline.