Checkpoint failed after time out. We observed there has some subtask didn't respond. Any idea will cause this problem?

Job Context:

Parallelism: 5

Data Volume: Under 40k

BackPressure: send to another API at the end of the job which may take some time there.

missing subtask respond External Call:

  Future<> future = Producer.send(topic, genericRecord, dataSetID);

  return Boolean.TRUE;

1 Answers

David Anderson On

What seems likely here is that future.get() blocks, and for whatever reason, fails to return within the checkpoint timeout interval.

What I can suggest is that you use Flink's RichAsyncFunction instead. This will have the advantage that the subtasks won't be blocked, thereby allowing checkpoints to complete.

RichAsyncFunction will checkpoint the unresolved futures, and re-issue those requests when recovering from failure.