Google Cloud Dataflow BigQuery to Bigtable Transfer - Throttle Write Speed?

Question

Google Cloud Dataflow BigQuery to Bigtable Transfer - Throttle Write Speed?

127 views Asked by Nikes MLIB At 13 November 2023 at 20:44

I have a number of Dataflow templates to copy data from BigQuery to Bigtable tables.

The largest of which is about 9 million rows, 22GB worth of data.

There are no complex mutations, it's just a copy.

I've noticed that while running the Dataflow templates, CPU of the Bigtable instance spikes up to 100% and read/write latency is quite slow. This happens even with only 1 worker / no customization on the threads.

I've tried tinkering with the number of workers and constraining the numberOfWorkerHarnessThreads but haven't been able to find a combination that still loads the data in a reasonable amount of time and doesn't spike the Bigtable instance.

Pipeline

        BigQueryBigtableTransferOptions options =
            PipelineOptionsFactory
                .fromArgs(args)
                .withValidation()
                .as(BigQueryBigtableTransferOptions.class);

        CloudBigtableTableConfiguration config =
            new CloudBigtableTableConfiguration.Builder()
                .withProjectId(options.getBigtableProjectId())
                .withInstanceId(options.getBigtableInstanceId())
                .withTableId(options.getBigtableTableId())
                .build();

        Pipeline p = Pipeline.create(options);

        p.apply(BigQueryIO.readTableRows().withoutValidation().fromQuery(options.getBqQuery())
                .usingStandardSql())
            .apply(ParDo.of(new Transform(options.getBigtableRowKey())))
            .apply(CloudBigtableIO.writeToTable(config));

        p.run();

The BigQuery query is just a select * query and the Transform operation just adds a column to the Bigtable row for every column from BQ, no additional logic.

Original Q&A

There are 3 answers

**John Hanley** · Answer 1 · 2023-11-19T19:15:39+00:00

What is the Biggable cluster's node scaling? What is the minimum and current number of nodes?

Workloads are scaled by increasing the number of nodes. When you have sudden batch workloads, it can take 20 minutes under load before there is a significant improvement in cluster performance.

If the cluster is set to autoscale, set the minimum number of nodes so that the cluster does not scale down too far.

If set to manual, add more nodes at least 20 minutes before the workloads increase.

**Nikes MLIB** · Answer 2 · 2023-11-21T15:55:57+00:00

Google recommended setting the mutation flow control configuration option:

https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-client-core-parent/bigtable-hbase/src/main/java/com/google/cloud/bigtable/hbase/BigtableOptionsFactory.java

.withConfiguration(BigtableOptionsFactory.BIGTABLE_ENABLE_BULK_MUTATION_FLOW_CONTROL, Boolean.TRUE.toString())

We ran some tests and even with only 1 node in the Bigtable cluster, the cluster CPU hovered around the recommended max CPU instead of spiking to 100% and read times remained relatively flat during the whole duration of the job.

**Bora** · Answer 3 · 2023-12-10T22:12:37+00:00

In addition to what Nikes MLIB said, you can also use request priorities to run it as a low priority job.

https://cloud.google.com/bigtable/docs/request-priorities

But batch write flow control would be the best place to start https://cloud.google.com/bigtable/docs/writes#flow-control

TechQA.

Google Cloud Dataflow BigQuery to Bigtable Transfer - Throttle Write Speed?

There are 3 answers

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-BIGQUERY

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Related Questions in GOOGLE-CLOUD-BIGTABLE

Popular Questions

Trending Questions