Apache Beam SqlTransform does not process data distributed. It doesn't use multiple workers. How to deal with Dataflow pipeline "straggler detected"

49 views Asked by At

Dataflow pipeline straggler

I run a pipeline with a SqlTransform component. The SqlTransform compute some windowing aggregates like rolling average. The pipeline is set to use up to three workers but in the Execution details tab, I can see that for this component is reported "1 straggler detected" and the processing very slow. I think it is using only one worker.

[Update]: I identified that the problem is generated by this computation:

NUM_1 /  SUM(NUM_1) OVER() AS COST_2_DATASET_AVERAGE,

My questions are: 1) why it is not using more than one worker for this step and 2) how to deal with this issue?

The windowing query I use is:

windowing_query = """SELECT DATE_STR, SUBS_ID, (NUM_1 + NUM_2 + NUM_3) AS TOTAL_COST,
                    NUM_1 /  SUM(NUM_1) OVER() AS COST_2_DATASET_AVERAGE,
                    AVG(NUM_1) OVER (w ROWS 2 PRECEDING) as AVERAGE_COST_LAST_3MTHS
                    FROM PCOLLECTION 
                    WINDOW w AS (PARTITION BY SUBS_ID ORDER BY DATE_STR)"""

feature_rows = rows_train_dataset | SqlTransform(windowing_query)

enter image description here

1

There are 1 answers

1
Kenn Knowles On

I think the problem may be that the OVER() clause does not include a PARTITION BY since you did not reference w in it.