Does apache beam processing time, avoid late data?

34 views Asked by Ay_mhwan At 11 January 2024 at 01:27

If we are using beam streaming, with for example kafka source. If we use the default processing timestamp, is it possible to have late data?

For example a pipeline like that:

    pipeline
     | "ReadFromKafka" >> ReadFromKafka()
     | "WINDOW" >> beam.WindowInto(FixedWindows(5))
     | "DESERIALIZE REQUESTS" >> beam.Map(deserialize).with_output_types(Something)
     | "YIELD MULTIPLE ITEMS" >> beam.ParDo(RandomLatency())
     | "GROUP" >> beam.GroupByKey()
     | "BIG PROCESSING" >> beam.ParDo(BigProcessing())
 )

I understand that windows are not use until groupByKey, but for a given request in kafka, I will have a parDo that yield multiple items with the same key (I even split the pipeline and join with CoGroupByKey)

From my understanding, because I am using processing timestamp, my pipeline can't have late data, in my GroupByKey and CoGroupByKey is it correct ?

Because watermark will always be behind of the data that enter the pipeline? Even if my groupby happen in late stage of the pipelines?

Original Q&A

TechQA.

Does apache beam processing time, avoid late data?

There are 0 answers

Related Questions in APACHE-FLINK

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Related Questions in APACHE-BEAM-KAFKAIO

Popular Questions

Trending Questions