AWS FIFO SQS: How does SQS maintain ordering in a group when a message is stuck in DLQ?

1k views Asked by At

I am planning to use AWS FIFO SQS to keep record of current status of each item in my datastore. I will be using the unique identifier of each item as the messageGroupId to ensure strict ordering of messages for each item.

Does SQS ensure that if a message belonging to a particular group is in the DLQ, then no message from that group becomes visible to the consumer until the DLQ message is either deleted or re-driven back to the main queue?

For example, there can be following three messages received in order:

(groupIdA, M1)
(groupIdA, M2)
(groupIdA, M3)

My poller successfully consumes M1, but fails to process M2. It tries until the maxReceive count is exhausted and the message is pushed by SQS to DLQ. Now, I still have another message M3 waiting to be consumed. I want to ensure that M3 only gets processed after M2 is successfully consumed.

Based on the definition of FIFO queue, it should be doing something like to ensure strict ordering. However, I couldn't locate an exact mention of this being supported in the AWS docs. Can anyone help me out?

2

There are 2 answers

0
httpdigest On BEST ANSWER

It does not appear to be the case that SQS messages from the same FIFO source queue group are blocked until the failed message residing in the DLQ is re-submitted to the source queue and processed by a consumer.

There is one bullet point in this document that hints to that: https://aws.amazon.com/de/blogs/compute/using-amazon-sqs-dead-letter-queues-to-control-message-failure/

Don’t use a dead-letter queue with a FIFO queue if you don’t want to break the exact order of messages or operations. For example, don’t use a dead-letter queue with instructions in an Edit Decision List (EDL) for a video editing suite, where changing the order of edits changes the context of subsequent edits.

It explicitly says that using DLQ with a FIFO source queue can break ordering.

So, even with a FIFO queue, once a failed message ends up in the DLQ, then the consumers will be handed later messages from the same group.

0
user2501711 On

I am running into the same problem, currently we are using a FIFO queue with FIFO DLQ. Locally, we keep a map to keep track of the failed message with message group id as the key. And we only process the message if there are no records in the failure map for the same message group. This works perfectly fine for the most part. However if we have a DLQ setup with receiveCount, we will still receive the message that has not been processed because the previous messages have been moved to DLQ.

I think one solution is to not use DLQ, so that particular message group will be retried indefinitely. Or move the failureMap to a distributed cache with TLL so that we can ensure no messages will be processed if previous failed within a period of time and still take advantage of DLQ.