Best approach for BigQuery data transformations

4.3k views Asked by At

I already have terabytes of data stored on BigQuery and I'm wondering to perform heavy data transformations on it.

Considering COSTS and PERFORMANCE, what the best approach you guys would suggest to perform these transformations for future usage of these data on BigQuery?

I'm considering a few options:
1. Read raw data from DataFlow and then load the transformed data back into BigQuery?
2. Do it directly from BigQuery?

Any ideas about how to proceed with this?

2

There are 2 answers

0
robin On BEST ANSWER

According to this Google Cloud Documentation, the following questions should be done to choose between DataFlow or BigQuery tool for ELT.

Although the data is small and can quickly be uploaded by using the BigQuery UI, for the purpose of this tutorial you can also use Dataflow for ETL. Use Dataflow for ETL into BigQuery instead of the BigQuery UI when you are performing massive joins, that is, from around 500-5000 columns of more than 10 TB of data, with the following goals:

  • You want to clean or transform your data as it's loaded into BigQuery, instead of storing it and joining afterwards. As a result, this approach also has lower storage requirements because data is only stored in BigQuery in its joined and transformed state.

  • You plan to do custom data cleansing (which cannot be simply achieved with SQL).

  • You plan to combine the data with data outside of the OLTP, such as logs or remotely accessed data, during the loading process.

  • You plan to automate testing and deployment of data-loading logic using continuous integration or continuous deployment (CI/CD).

  • You anticipate gradual iteration, enhancement, and improvement of the ETL process over time.

  • You plan to add data incrementally, as opposed to performing a one-time ETL.

0
aga On

I wrote down some most important things about performance, you can find there consideration regarding your question about using DataFlow.

Best practices considering performance:

  • Choosing file format:

BigQuery supports a wide variety of file formats for data ingestion. Some are going to be naturally faster than others. When optimizing for load speed, prefer using the AVRO file format, which is binary, row-based format and enables to split it and then read it in parallel with multiple workers.

Loading data from compressed files, specifically CSV and JSON, is going to be slower than loading data in a other format. And the reason being is because, since the compression of Gzip is not splitable, there is a need to take that file, load it onto a slot within BQ, and then do the decompression, and then finally parallelize the load afterwards.

**FASTER**
Avro(Compressed)
Avro(Uncompressed)
Parquet/ORC
CSV
JSON
CSV (Compressed)
JSON(Compressed
**SLOWER**
  • ELT / ETL:

After loading data into BQ, you can think about transformations (ELT or ETL). So in general, you want to prefer ELT over ETL where possible. BQ is very scalable and can handle large transformations on a ton of data. ELT is also quite a bit simpler, because you could just write some SQL queries, transform some data and then move data around between tables, and not have to worry about managing a separate ETL application.

  • Raw and staging tables:

Once, you have started loading data into BQ, in general, within your warehouse, you're going to want to leverage raw and staging tables before publishing to reporting tables. The raw table essentially contains the full daily extract, or a full load of the data that they're loading. Staging table then is basically your change data capture table, so you can utilize queries or DML to marge that data into your staging table and have a full history of all the data that's being inserted. And then finally your reporting tables are going to be the ingest that you publish out to your users.

  • Speeding up pipelines using DataFlow:

When you're getting into streaming loads really complex batch loads (that doesn't really fit into SQL cleanly), you can leverage DataFlow or DataFusion to speed up those pipelines, and do more complex activities on that data. And if you're starting with streaming, I recommend using the DataFlow templates - Google provides it for loading data from multiple different places and moving data around. You can find those templates in DataFlow UI, within Create Job from Template button, you'll find all these templates. And if you find that it mostly fits your use case, but want to make one slight modification, all those templates are also open sourced (so you can go to repo, modify the code to fit your needs).

  • Partitioning:

Partition in BQ physically split your data on disk, based on ingestion time or based on a column within your data. Efficiently query over the parts of the table you want. This provides huge cost and performance benefits, especially on large fact tables. Whenever you have a fact table or temporal table, utilize a partition column on your date dimension.

  • Cluster Frequently Accessed Fields:

Clustering allows you to physically order data within a partition. So you can do Clustering by one or multiple keys. This provide massive performance benefits when used properly.

  • BQ reservations:

It allows to create reservations of slots, assign project to those reservations, so you can allocate more or less resources to certain types of queries.

Best practices considering saving costs you can find in official documentation.

I hope it helps you.