Synapse Pipeline : DF-Executor-OutOfMemoryError

192 views Asked by At

I am having nested json as source in gzip format. In Synapse pipeline I am using the dataflow activity where I have mentioned the compression type as gzip in the source dataset. The pipeline was executing fine for small size files under 10MB. When I tried to execute pipeline for a large gzip file about 89MB.

The dataflow activity failed with below error:

Error1 {"message":"Job failed due to reason: Cluster ran into out of memory issue during execution,
 please retry using an integration runtime with bigger core count and/or memory optimized compute type.
 Details:null","failureType":"UserError","target":"df_flatten_inf_provider_references_gz","errorCode":"DF-Executor-OutOfMemoryError"}

Error1

Requesting for your help and guidance.

To resolve Error1, I tried Azure integration runtime with bigger core count (128+16 cores) and memory optimized compute type but still the same error. I thought it could be too intensive to read json directly from gzip so I tried a basic copy data activity to decompress the gzip file first but still its failing with the same error.

1

There are 1 answers

0
Pratik Lad On

As per your scenario I would recommend Instead of pulling all the data from Json file, pulled from small Json files. You first partitioned your big Json file in few parts with the dataflow using Round robin partition technique. and store this files into a folder in blob storage

enter image description here

Data is evenly distributed among divisions while using round robin. When you don't have excellent key candidates, use round-robin to put a decent, clever partitioning scheme into place. The number of physical divisions is programmable.

You need to evaluate the data size or the partition number of input data, then set reasonable partition number under "Optimize". For example, the cluster that you use in the data flow pipeline execution is 8 cores and the memory of each core is 20GB, but the input data is 1000GB with 10 partitions. If you directly run the data flow, it will meet the OOM issue because 1000GB/10 > 20GB, so it is better to set repartition number to 100 (1000GB/100 < 20GB).

And after above process use these partitioned files to perform dataflow operations with for each activity and in last merge them in a single file.

Reference: Partition in dataflow.