How can I speed up the GCP datafusion(datapipeline)?

Question

How can I speed up the GCP datafusion(datapipeline)?

234 views Asked by Quack At 19 October 2020 at 06:35

About 300T of data is being transferred to Big Query using Google Cloud platform datafusion (option: dev).

It currently took 34 minutes to process approximately 16GB. It takes about 10 days to process 6T data.

What settings can be modified in datafusion to quickly perform ETL operations in the data pipeline?

Thank you for reading.

Original Q&A

There are 1 answers

**aga** · Accepted Answer · 2020-10-19T10:02:42+00:00

What you can do is changing the compute profile settings, which specifies how and where a pipeline is executed. For example, a profile includes the type of cloud provider, the service to use on the cloud provider (such as Dataproc), resources (memory and CPU), image, minimum and maximum node count, and other values.

Learn more about profiles on the CDAP documentation site.

One of the option is to create a new compute profile with a higher limit on worker memory or overriding worker memory for a run of the pipeline:

Click on System Admin in the top right and then click on the Configuration tab
Click System Compute profiles
Click on create new profile
Choose Cloud Dataproc
Leave the Project ID and Service account key blank
Enter the required configuration of worker node
Click on Save

Once the new compute profile is create attach the compute profile to the pipeline by clicking on configure in pipeline detail view and choosing the newly created compute profile and click on Save.

Additionally, please check autoscaling option in DataFsuion.

TechQA.

How can I speed up the GCP datafusion(datapipeline)?

There are 1 answers

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-CLOUD-DATA-FUSION

Popular Questions

Popular Tags

Trending Questions