If we are using beam streaming, with for example kafka source. If we use the default processing timestamp, is it possible to have late data?
For example a pipeline like that:
pipeline | "ReadFromKafka" >> ReadFromKafka() | "WINDOW" >> beam.WindowInto(FixedWindows(5)) | "DESERIALIZE REQUESTS" >> beam.Map(deserialize).with_output_types(Something) | "YIELD MULTIPLE ITEMS" >> beam.ParDo(RandomLatency()) | "GROUP" >> beam.GroupByKey() | "BIG PROCESSING" >> beam.ParDo(BigProcessing()) )
I understand that windows are not use until groupByKey, but for a given request in kafka, I will have a parDo that yield multiple items with the same key (I even split the pipeline and join with CoGroupByKey)
From my understanding, because I am using processing timestamp, my pipeline can't have late data, in my GroupByKey and CoGroupByKey is it correct ?
Because watermark will always be behind of the data that enter the pipeline? Even if my groupby happen in late stage of the pipelines?