Dataflow pipeline straggler
I run a pipeline with a SqlTransform component. The SqlTransform compute some windowing aggregates like rolling average. The pipeline is set to use up to three workers but in the Execution details tab, I can see that for this component is reported "1 straggler detected" and the processing very slow. I think it is using only one worker.
[Update]: I identified that the problem is generated by this computation:
NUM_1 / SUM(NUM_1) OVER() AS COST_2_DATASET_AVERAGE,
My questions are: 1) why it is not using more than one worker for this step and 2) how to deal with this issue?
The windowing query I use is:
windowing_query = """SELECT DATE_STR, SUBS_ID, (NUM_1 + NUM_2 + NUM_3) AS TOTAL_COST,
NUM_1 / SUM(NUM_1) OVER() AS COST_2_DATASET_AVERAGE,
AVG(NUM_1) OVER (w ROWS 2 PRECEDING) as AVERAGE_COST_LAST_3MTHS
FROM PCOLLECTION
WINDOW w AS (PARTITION BY SUBS_ID ORDER BY DATE_STR)"""
feature_rows = rows_train_dataset | SqlTransform(windowing_query)

I think the problem may be that the
OVER()clause does not include aPARTITION BYsince you did not referencewin it.