Spark foreachBatch concurrency and time-based triggers

24 views Asked by At

Does foreachBatch method is guaranteed to run a callback "exclusively", or it will be started asynchronously? I'm just working on a custom sink with relying on consecutive, non-concurrent batches, and it's quite important. For example, a code:

.writeStream()
    .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
    .foreachBatch((dataframe, batchId) -> {
        // 
        // Can it run the callback next time before a previous one has completed?
        // (i.e. if the previous is working longer than the trigger's time interval)
        // 
    })

There is a similar API in Spark DStreams, foreachRDD, and docs explain how it works:

By default, output operations are executed one-at-a-time.

But I couldn't find anything like that in structured streaming docs. Did I miss anything?

(Of course, I could just run and check, but it would be good to have a clear answer to ensure that it's not an internal implementation detail and it will not be suddenly changed in a minor version upgrade.)

0

There are 0 answers