Documentaion saying file event notification systems do not guarantee 100% delivery of all files and it recommneds to use backfills to guarantee that all files eventually get processed.
But its not clear how to use it and where to use it in code. Should it be part of the spark.readStream or writeStream.
And if there is more documentaion about it would appreciate that.
You can use the
cloudFiles.backfillIntervallike this:According to the documentation, it checks for unprocessed files asynchronously and processes them.
By setting the interval, you can control how frequently the system checks for unprocessed files.
Output:
If you see the checkpoint location.
%fs head dbfs:/checkpointLocation2/offsets/0Here, based on
lastBackfillStartTimeMsandlastBackfillFinishTimeMs, the trigger occurs.You can also observe that there are 5 files inside offsets, meaning it checked for old files to process 5 times.This is when i set the interval for
30 secondsfor1 dayit will trigger once a day.