DynamoDB Streams with Lambda, how to process the records in order (by logical groups)?

6k views Asked by At

I want to use DynamoDB Streams + AWS Lambda to process chat messages. Messages regarding the same conversation user_idX:user_idY (a room) must be processed in order. Global ordering is not important.

Assuming that I feed DynamoDB in the correct order (room:msg1, room:msg2, etc), how to guarantee that the Stream will feed AWS Lambda sequentially, with guaranteed ordering of the processing of related messages (room) across a single stream?

Example, considering I have 2 shards, how to make sure the logical group goes to the same shard?

I must accomplish this:

Shard 1: 12:12:msg3 12:12:msg2 12:12:msg1 ==> consumer
Shard 2: 13:24:msg2 51:91:msg3 13:24:msg1 51:92:msg2 51:92:msg1 ==> consumer

And not this (messages are respecting the order that I saved in the database, but they are being placed in different shards, thus incorrectly processing different sequences for the same room in parallel):

Shard 1: 13:24:msg2 51:92:msg2 12:12:msg2 51:92:msg2 12:12:msg1 ==> consumer
Shard 2: 51:91:msg3 12:12:msg3 13:24:msg1 51:92:msg1 ==> consumer

This official post mentions this, but I couldn't find anywhere in the docs how to implement it:

The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.

Questions

1) How to set a partition key in DynamoDB Streams?

2) How to create Stream shards that guarantee partition key consistent delivery?

3) Is this really possible after all? Since the official article mentions: a given key will be present in at most one of a set of sibling shards that are active at a given point in time so it seems that msg1 may go to shard 1 and then msg2 to shard 2, as my example above?

EDITED: In this question, I found this:

The amount of shards that your stream has, is based on the amount of partitions the table has. So if you have a DDB table with 4 partitions, then your stream will have 4 shards. Each shard corresponds to a specific partition, so given that all items with the same partition key should be present in the same partition, it also means that those items will be present in the same shard.

Does this mean that I can achieve what I need automatically? "All items with the same partition will be present in the same shard". Does Lambda respect this?

EDIT 2: From the FAQ:

The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.

I don't care about global ordering, just logical one as per example. Still, not clear if the shards group logically with this answer from the FAQ.

1

There are 1 answers

0
Alexander Patrikalakis On BEST ANSWER

In-order processing for updates on the same key will happen automatically. As described in this presentation, one Lambda function per active shard is run. Because all the updates for a particular partition/sort key appear in exactly one shard lineage, they are processed in order.