How can I speed up the GCP datafusion(datapipeline)?

239 views Asked by At

About 300T of data is being transferred to Big Query using Google Cloud platform datafusion (option: dev).

It currently took 34 minutes to process approximately 16GB. It takes about 10 days to process 6T data.

What settings can be modified in datafusion to quickly perform ETL operations in the data pipeline?

Thank you for reading.

1

There are 1 answers

4
aga On BEST ANSWER

What you can do is changing the compute profile settings, which specifies how and where a pipeline is executed. For example, a profile includes the type of cloud provider, the service to use on the cloud provider (such as Dataproc), resources (memory and CPU), image, minimum and maximum node count, and other values.

Learn more about profiles on the CDAP documentation site.

One of the option is to create a new compute profile with a higher limit on worker memory or overriding worker memory for a run of the pipeline:

  1. Click on System Admin in the top right and then click on the Configuration tab
  2. Click System Compute profiles
  3. Click on create new profile
  4. Choose Cloud Dataproc
  5. Leave the Project ID and Service account key blank
  6. Enter the required configuration of worker node
  7. Click on Save

Once the new compute profile is create attach the compute profile to the pipeline by clicking on configure in pipeline detail view and choosing the newly created compute profile and click on Save.

Additionally, please check autoscaling option in DataFsuion.