I have a query like this.
WITH all_products AS (
SELECT
gtin,
category,
product_name,
product_image,
brand,
manufacturer
FROM `products`
),
client_products as (
SELECT *
FROM all_products
WHERE
client_id = "usdemoaccount" and is_client_product = true
),
competitor_products as (
SELECT *
FROM all_products
WHERE
client_id = "usdemoaccount" and is_client_product = false
);
Here computation for all_products happens twice because its reference twice in the below code. But we need not do it twice and save on compute if re-use the above results.
The reason for this from BQ documentation mentions, non recursive CTE's are not materialized.
BigQuery only materializes the results of recursive CTEs, but does not materialize the results of non-recursive CTEs inside the WITH clause. If a non-recursive CTE is referenced in multiple places in a query, then the CTE is executed once for each reference.
I'm exploring alternatives to address this issue. While I understand that using temporary tables is one option, I'm concerned about potential drawbacks such as increased storage costs and concurrency issues, especially when the same API is used by multiple users with different parameters.
What are some effective strategies or best practices for optimizing the performance of CTEs in BigQuery? Specifically, I'm interested in approaches that can help materialize non-recursive CTEs or improve query performance without resorting to temporary tables.
Even with temporary tables, if there is a option to automatically clean up those tables to avoid storage costs and handle concurrently out of the box, that should also be preferable.
One big factor in keeping costs down in BigQuery is to always explicitly specify the columns you want to select from, rather than using
SELECT *.Additionally, for runtime, limit the data to as few rows as possible, e.g. limit by date, and/or some other
WHEREcondition, so any later queries or CTEs only work with the smallest required number or rows.Cost vs runtime
In the query provided, there are no additional costs as nothing is materialised. The original query only from
productsonly runs once, so only 1 set of charges is incurred, even if you reference the first CTE twice later, so that isn't a cost or storage concern, only a runtime one.Reducing runtime
To reduce the runtime, you can set the flag for client or competitor in the first (and only) CTE, and then call it as necessary using the derived field
client_competitor_product. This replaces working out which CTE you want to use -client_productorcompetitor_product.Reducing data parsed
Additionally, if the condition is the same in both clauses (
client_id = "usdemoaccount"), move that to aWHEREclause so it limits the dataset it is working with before evaluating theCASEstatement.Notes on the original query from the OP
In the 2nd and 3rd CTEs, 2x referenced fields are not present in all_products, so they would fail -
client_idandis_client_productare not present inall_products.There is no output
SELECTfor the CTE, so which CTE could be referenced? Sometimes it isclient_productsand sometimescompetitor_products. If youUNION ALLthem, you still can't tell which product belongs to which source, without an additional field, e.g. source, to distinguish them.Limit the fields selected, the data evaluated and derive the field/s required