Google BigQuery Optimization Strategies

Question

Google BigQuery Optimization Strategies

3.2k views Asked by Silvia Pina At 10 June 2015 at 17:28

I am querying data from Google Analytics Premium using Google BigQuery. At the moment, I have one single query which I use to calculate some metrics (like total visits or conversion rate). This query contains several nested JOIN clauses and nested SELECTs. While querying just one table I am getting the error:

Error: Resources exceeded during query execution.

Using GROUP EACH BY and JOIN EACH does not seem to solve this issue.

One solution to be adopted in the future involves extracting only the relevant data needed for this query and exporting it into a separate table (which will then be queried). This strategy works in principle, I have already a working prototype for it.

However, I would like to explore additional optimization strategies for this query that work on the original table.

In this presentation You might be paying too much for BigQuery some of them are suggested, namely:

Narrowing the scan (already doing it)
Using query cache (does not apply)

The book "Google BigQuery Analytics" mentions also adjusting query features, namely:

GROUP BY clauses generating large number of distinct groups (already did this)
Aggregation functions requiring memory proportional to the number of input values (probably does not apply)
Join operations generating a greater number of outputs than inputs (does not seem to apply)

Another alternative is just splitting this query into its composing sub-queries, but at this moment I cannot opt for this strategy.

What else can I do to optimize this query?

Original Q&A

There are 2 answers

Segah Meer On 07 July 2016 at 21:15

Google's BigQuery has a lot of quirks because it is not ANSI compatible. These quirks are also its advantages. That said, you will waste too much time writing queries against BigQuery directly. You should either use an API/SDK or a tool such as Looker that will generate SQL for you: https://looker.com/blog/big-query-launch-blog at execution time, giving you resource estimate before spending your money.

**Flame_Phoenix** · Accepted Answer · 2017-03-09T11:42:45+00:00

Why does BigQuery have errors?

BigQuery is a shared and distributed resource and as such it is expected for jobs to fail at some point in time. This is why the only solution is to retry the job with exponential backoff. As a golden rule, jobs should be retried a minimum of 5 times and as long as a job is not unable to complete for more than 15 minutes the service is within the SLA [1].

What can be the causes?

I can think off two causes for this that can be affecting your queries:

Data skewing [2]
Unoptimized queries

Data Skewing

Regarding the first situation, this happens when data is not evenly distributed. Because the inner mechanic of BigQuery uses a version of MapReduce this means if you have for example a music or video file with millions of hits, the workers doing that data aggregation will have their resources exhausted while the other workers won’t be doing much at all because the aggregations for the videos or musics they are processing have little to no hits.

If this is the case, the recommendation is to uniformly distribute your data.

Unoptimized queries

If you don’t have access to modifying the data, the only solution is to optimize the queries. Optimized queries follow these general rules:

When using a SELECT, make sure you only select strictly the columns you need as this diminishes the cardinality of the requests (avoid using SELECT * for example)
Avoid using ORDER BY clauses on large sets of data
Avoid using GROUP BY clauses as they create a barrier to parallelism
Avoid using JOINS as these are extremely heavy on the worker's memory, and may cause resource starvation and resource errors (as in not enough memory).
Avoid using Analytical functions [3]
If possible, do your queries on Partitioned tables [4].

Following any of these strategies should help your queries have less errors and improve their overall running time.

Additional

You can't really understand BigQuery unless you understand MapReduce first. For this reason I strongly recommend you have a look on Hadoop tutorials, like the one in tutorialspoint:

https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

For a similar version of BigQuery, but that is Open Source (and less optimized in every single way) you can also check Apache Hive [4]. If you understand why Apache Hive fails, you understand why BigQuery fails.

[1] https://cloud.google.com/bigquery/sla

[2] https://www.mathsisfun.com/data/skewness.html

[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions

[4] https://cloud.google.com/bigquery/docs/partitioned-tables

[5] https://en.wikipedia.org/wiki/Apache_Hive