Lambda accessing Elasticsearch times out after about a day

235 views Asked by At

I am working on a project which uses an SQS triggered Lambda to ingest some content into AWS Elasticsearch.

The Lambda and the Elasticsearch service are located in the same VPC.

The source code is very simple and its size is 10.5kB (mostly static resources ie. xsl files).

The libraries used are packaged in a separate layer.

When I first deploy the Lambda, everything works correctly, the lambda gets invoked thousands of times and for about a day or two, everything works as expected. However, it then starts timing out and once it does, it always does until I do a fresh redeploy.

This occurs whether I use the elasticsearch-py client or requests.get.

Increasing the timeout or memory allocation does not help.

Recycling objects or re-instantiating everything on each invocation does not make any difference, either.

Has anyone experienced similar issues?

1

There are 1 answers

0
Leo On BEST ANSWER

‍♂️ This turned out to be a problem with our deployment setup...

My project interacts with an Elasticsearch instance which is created by another team Terraform deployment.

When I create the resources for my project environment (also in Terraform), I add an aws_security_group_rule to the existing security group:

data "aws_security_group" "es_sg" {
  name = var.security_group_name
}
resource "aws_security_group_rule" "allow_lambda_access_to_es" {
  type = "ingress"
  to_port = 443
  protocol = "tcp"
  from_port = 443
  security_group_id = data.aws_security_group.es_sg.id
  source_security_group_id = module.lambda.sg_id
  description = "Ingestion lambda access to ES"
}

When the other team re-apply, the rule is deleted...

The fix for this was to get the other team to define their rule as a separate resource instead of inline.