SQS Messages not deleted after DLQ Implementation

171 views Asked by At

I recently implemented DLQ(Dead Letter Queue) for SQS. I have done the following three configurations.

  1. Default visibility timeout in SQS.
  2. In the Dead Letter Queue configuration, enable and Maximum receives are configured as 3.
  3. In Lambda configuration, Report batch item failures are enabled.

But the problem, all success/failure messages are processed three times and moved to DLQ.

For success cases, the correct JSON response is returned.

Once I disable the "Report batch item failures is enabled", message will deleted for both success/failure cases.

2

There are 2 answers

0
bgs On BEST ANSWER

Once we enable the "Report batch item failures is enabled", we should change the response type of the function.

Old Code : public async Task<String> FunctionHandlerAsync(SQSEvent sqsEvent)

New Code : public async Task<SQSBatchResponse> FunctionHandlerAsync(SQSEvent sqsEvent)

Due to response type changes, we should change the code in function implementation.

create the object,

 List<SQSBatchResponse.BatchItemFailure> batchItemFailures = new List<SQSBatchResponse.BatchItemFailure>();

For exception cases,

batchItemFailures.Add(new SQSBatchResponse.BatchItemFailure { ItemIdentifier = record.MessageId });

Finally, return the batch response to function,

return new SQSBatchResponse(batchItemFailures);

After the above changes, success messages are correctly deleted.

Reference: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting

0
Allan Chua On

There are two immediate solutions that I could think help you debug your challenge:

  1. The queue's visibility timeout may be shorter compared to the entire batch's processing time in Lambda. This will cause all messages in the batch that wasn't deleted by your function to become visible again until it hits the maximum redrive count and get sent to the dead-letter queue. To solve this, make sure that your SQS queue's visibility timeout is 6X longer compared to your function's maximum execution duration.
  2. If you have ruled out that solution 1 isn't fixing your problem, inspect if any of the 3 messages in the batch is failing and make sure you handle it in such a way that it gets sent to a storage dedicated for observing messages that failed. This will make sure that your message batch gets processed even if some of the messages causes failure.

More Tips:

  • It is also considered a good practice to implement idempotency in your Lambda function to prevent the re-processing of messages that were successfully processed previously.
  • It will be cooler if you can hook-up a notification system for monitoring the dedicated storage for messages that causes failure / poisoning of the queue.
  • The batch failure actually causes Lambda to reduce the number of SQS processors (Defaults to 5) if retries are detected from your queue.