About 300T of data is being transferred to Big Query using Google Cloud platform datafusion (option: dev).
It currently took 34 minutes to process approximately 16GB. It takes about 10 days to process 6T data.
What settings can be modified in datafusion to quickly perform ETL operations in the data pipeline?
Thank you for reading.
What you can do is changing the compute profile settings, which specifies how and where a pipeline is executed. For example, a profile includes the type of cloud provider, the service to use on the cloud provider (such as Dataproc), resources (memory and CPU), image, minimum and maximum node count, and other values.
Learn more about profiles on the CDAP documentation site.
One of the option is to create a new compute profile with a higher limit on worker memory or overriding worker memory for a run of the pipeline:
System Admin
in the top right and then click on theConfiguration
tabOnce the new compute profile is create attach the compute profile to the pipeline by clicking on configure in pipeline detail view and choosing the newly created compute profile and click on
Save
.Additionally, please check autoscaling option in DataFsuion.