I am having nested json as source in gzip format. In Synapse pipeline I am using the dataflow activity where I have mentioned the compression type as gzip in the source dataset. The pipeline was executing fine for small size files under 10MB. When I tried to execute pipeline for a large gzip file about 89MB.
The dataflow activity failed with below error:
Error1 {"message":"Job failed due to reason: Cluster ran into out of memory issue during execution,
please retry using an integration runtime with bigger core count and/or memory optimized compute type.
Details:null","failureType":"UserError","target":"df_flatten_inf_provider_references_gz","errorCode":"DF-Executor-OutOfMemoryError"}
Requesting for your help and guidance.
To resolve Error1, I tried Azure integration runtime with bigger core count (128+16 cores) and memory optimized compute type but still the same error. I thought it could be too intensive to read json directly from gzip so I tried a basic copy data activity to decompress the gzip file first but still its failing with the same error.
As per your scenario I would recommend Instead of pulling all the data from Json file, pulled from small Json files. You first partitioned your big Json file in few parts with the dataflow using Round robin partition technique. and store this files into a folder in blob storage
Data is evenly distributed among divisions while using round robin. When you don't have excellent key candidates, use round-robin to put a decent, clever partitioning scheme into place. The number of physical divisions is programmable.
And after above process use these partitioned files to perform dataflow operations with for each activity and in last merge them in a single file.
Reference: Partition in dataflow.