I have Flink stream processing application which read stream of messages from Pulsar Topic, process them and store the file in S3. It perform below operation.
- Read messages from Pulsar topic every 30 seconds with TumblingWindow
- KeyBy to divide the stream processing based on key.
- Process it and store it in S3.
- Notify downstream application
Happy path works very well. Problem starts with partial failures and recovery.
Step# 2 can create multiple different streams as there will be different keys in the stream.
Check point 1 Triggered. Stream 1 (Key 1) -- Processing is successful. Stream 2 (Key 2) -- Processing is failed for some reason at step 3 or 4 above. Check point 2 Completed.
If I throw exception, in case stream 2 is failed, It will fail the whole job and reprocess from Checkpoint 1. In this case, Stream 1 will be reprocessed which should not happen.
Is there a way in Flink we can manually avoid acknowledging Pulsar topic for only failed messages or only process failed records after restart. My requirement is to not to perform duplicate processing and reprocess only failed records.
I read savepoint can be one of the solution but did not find any concrete example.
Appreciate your help!!
The short answer in "no". Flink tracks source offsets and sink transactions (plus operator state), in order to support efficient exactly once processing.