Scanning entire bigtable (of a specific column family) via Dataflow

969 views Asked by At

We use 50-100 bigtable nodes (depending on the amount of data we process, this number varies between 50 and 100 throughout the day).

Every day we have a dataflow job which scans the entire bigtable (of one specific column family), and dumps the scanned data to GCS.

We have been experimenting with the number of worker nodes vs. the speed of scanning the entire table. Typically, the job scans ~100M rows (there are many more rows in the table, but we set the range to be a 24-hour window) and the size of the scanned data is roughly ~1TiB.

While the number of bigtable nodes is fixed (say, at 80), we varied the number of Dataflow worker nodes (n1-standard-1) from 15 to 50 in an incremental manner, and the speed of scanning did not seem to scale linearly. Similarly, when we keep the number of dataflow workers (at 50) fixed and varied the number of bt nodes (between 40 and 80), the read throughput did not seem to vary much (as long as there are "enough" btnodes). If this is the case, what other options do we have for a faster scanning job? One idea we have is to run several scanning jobs where each job scans a subset of contiguous rows, but we wish to avoid this approach.

Any help will be really appreciated!

1

There are 1 answers

0
Ramesh Dharan On

As phrased, this question is hard to answer in a general sense. Tuning Cloud Dataflow performance when reading from Cloud Bigtable requires some knowledge of your workload and may require different numbers of Dataflow worker nodes or Bigtable server nodes.

It's possible that you hit the upper bound of scan performance and the bottleneck was the underlying storage system below the Bigtable layer, but it's hard to say without more detail.

Generally when tuning Cloud Dataflow we recommend investigating the throttling and autoscaler approaches, though these are typically more of an issue for ingestion workloads than for simple scans.