Chrome UX Report: improve query performance

261 views Asked by At

I am querying the Chrome UX Report public dataset using the following query to get values for the indicated metrics over time for a set of country-specific tables. The query runs for a very long time (I stopped it at 180 seconds) because I don't know what the timeout is for a query or how to tell if the query hung.

I'm trying to get aggregate data year over year for average_fcp, average_fp and average_dcl. I'm not sure if I'm using BigQuery correctly or there are ways to optimize the query to make it runs faster

This is the query I'm using.

SELECT
  _TABLE_SUFFIX AS yyyymm,
  AVG(fcp.density) AS average_fcp,
  AVG(fp.density) as average_fp,
  AVG(dcl.density) as average_dcl
FROM
  `chrome-ux-report.country_cl.*`, 
  UNNEST(first_paint.histogram.bin) as fp,
  UNNEST(dom_content_loaded.histogram.bin) as dcl,
  UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
  form_factor.name = 'desktop' AND
  fcp.start > 20000
GROUP BY
  yyyymm
ORDER BY
  yyyymm
3

There are 3 answers

0
Alessandro On

Let me start saying that BigQuery query timeout is very long (6 hours) so you should not have a problem on this front but you might encounter other errors.

We had the same issue internally, we have datasets with data divided in country tables, even though the tables are partitioned on timestamp when running queries over hounders of tables, not only the query takes a long time, but sometime it will fail with resources exceeded error.

Our solution was to aggregate all this table into one single one adding 'country' column use it as clustering column. This not only made our queries executed but it made them even faster than our temporary solution of running the same query on a sub set of the country tables as intermediate steps and then combining the results together. It is now faster, easier and cleaner.

Coming back to your specific question, I suggest to create a new table (which you will need to host $$$) that is the combination of all the tables inside a dataset as a partitioned table.

The quickest way, unfortunately also the more expensive one (you will pay for the query scan), is to use a create table statement.

create table `project_id.dataset_id.table_id`
partition by date_month
cluster by origin
as (
select
  date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
  *
from `chrome-ux-report.country_cl.*`
);

If this query fails you can run it on a sub set of table e.g. where starts_with(_table_suffix, '2018') and the run the following query with the 'write append' disposition against the table you create before.

select
  date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
  *
from `chrome-ux-report.country_cl.*`
where starts_with(_table_suffix, '2019')

If you noticed I also used a clustering column, which is think is a best practice to do.

Note for who is curating Google public datasets.

It would be nice to have a public "chrome_ux_report" dataset with a just a single table partitioned by date and clustered by country.

1
Alexandre Moraes On

After carefully reviewing your query, I concluded that the processing time for each of the actions you are performing is around 6 seconds or less. Therefore, I decided to execute each task from each unnest and then append the tables together, using UNION ALL method.

The query ran within 4 seconds. The syntax is:

SELECT
  _TABLE_SUFFIX AS yyyymm,
  AVG(fcp.density) AS average_fcp,
FROM
  `chrome-ux-report.country_cl.*`, 
  UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
  form_factor.name = 'desktop' AND
  fcp.start > 20000
GROUP BY
  yyyymm


UNION ALL 

SELECT
  _TABLE_SUFFIX AS yyyymm,
  AVG(fp.density) as average_fp,

FROM
  `chrome-ux-report.country_cl.*`, 
  UNNEST(first_paint.histogram.bin) as fp
WHERE
  form_factor.name = 'desktop' 
GROUP BY
  yyyymm

UNION ALL 

SELECT
  _TABLE_SUFFIX AS yyyymm,
  AVG(dcl.density) as average_dcl
FROM
  `chrome-ux-report.country_cl.*`, 
  UNNEST(dom_content_loaded.histogram.bin) as dcl
WHERE
  form_factor.name = 'desktop' 
GROUP BY
  yyyymm
ORDER BY
  yyyymm  

In addition, I would like to point that according to the documentation it is advisable to avoid the excessive use of wildcards opting to use date ranges and materializing large datasets results. Also, I would like to point that BigQuery limits cached results to 10gb.

I hope it helps.

2
Felipe Hoffa On

I'm not sure if it makes mathematical sense to get the AVG() of all the densities - but let's do it anyways.

The bigger problem in the query is this:

  UNNEST(first_paint.histogram.bin) as fp,
  UNNEST(dom_content_loaded.histogram.bin) as dcl,
  UNNEST(first_contentful_paint.histogram.bin) AS fcp

-- that's an explosive join: It transforms one row with 3 arrays with ~500 elements each, into 125 million rows!!! That's why the query isn't running.

A similar query that gives you similar results:

SELECT yyyymm,
  AVG(average_fcp) average_fcp,
  AVG(average_fp) average_fp,
  AVG(average_dcl) average_dcl
FROM (
  SELECT
     _TABLE_SUFFIX AS yyyymm,
    (SELECT AVG(fcp.density) FROM UNNEST(first_contentful_paint.histogram.bin) fcp WHERE fcp.start > 20000) AS average_fcp,
    (SELECT AVG(fp.density) FROM UNNEST(first_paint.histogram.bin) fp) AS average_fp,
    (SELECT AVG(dcl.density) FROM UNNEST(dom_content_loaded.histogram.bin) dcl) AS average_dcl
  FROM `chrome-ux-report.country_cl.*` 
  WHERE form_factor.name = 'desktop'  
)
GROUP BY yyyymm
ORDER BY yyyymm

The good news: This query runs in 3.3 seconds.

Now that the query runs in 3 seconds, the most important question is: Does it make sense mathematically?

enter image description here


Bonus: This query makes more sense to me mathematically speaking, but I'm not 100% sure about it:

SELECT yyyymm,
  AVG(average_fcp) average_fcp,
  AVG(average_fp) average_fp,
  AVG(average_dcl) average_dcl
FROM (
  SELECT yyyymm, origin, SUM(weighted_fcp) average_fcp, SUM(weighted_fp) average_fp, SUM(weighted_dcl) average_dcl
  FROM (
    SELECT
       _TABLE_SUFFIX AS yyyymm,
      (SELECT SUM(start*density) FROM UNNEST(first_contentful_paint.histogram.bin)) AS weighted_fcp,
      (SELECT SUM(start*density) FROM UNNEST(first_paint.histogram.bin)) AS weighted_fp,
      (SELECT SUM(start*density) FROM UNNEST(dom_content_loaded.histogram.bin)) AS weighted_dcl,
      origin
    FROM `chrome-ux-report.country_cl.*`   
  )
  GROUP BY origin, yyyymm
)
GROUP BY yyyymm
ORDER BY yyyymm

enter image description here