I have a use case where I would be getting records from upstream for a particular batchID along with some meta data of the batch upfront. Example, I am told a batchID="ABC" will have 2000 records. After I start getting records in my service, I do some some processing and save it in DB with status = "PROCESSED". So my use case is, once I get all 2000 records for a batchID, I have to create a CSV file with all records (2000) in this batch and send it to some other service. Also, I update the status to "SENT".
Approach 1 (Naive): Run a query on composite GSI on batchID+status and check if count matches at every request. This will very expensive.
Approach 2: Use DynamoDB's atomic counter, where key = batchID and value is a count. At every DB insert, I make sure that count is incremented. I check the count and raise trigger if count matches to expectation. But in this case there would be cases of throttle and errors (i.e. if update fails).
Had it been SQL, I would have
SELECT COUNT(*) FROM records_table WHERE batchID = "ABC
I wanted to know if there's some hybrid approach in AWS that I can leverage to solve this use case.
I'd suggest using another table for batch indexing and processed record amount tracking. You could use DynamoDB stream to run lambda, which updates the amount in case of need (when the desired status is set). Also the very same lambda function would check if amount reached 2k and trigger another lambda function which does the sending. Below is more detailed architecture description.
DynamoDBDataTable
DynamoDBBatchIndexingTable
Lambda1
Lambda2