BigQuery COALESCE for synonymous structs in wildcard query

1k views Asked by At

In the table chrome-ux-report.all.201910 and earlier we had a field named experimental.first_input_delay. As of chrome-ux-report.all.201911 the same data has been renamed to first_input.delay.

Prior to this change, I've used wildcard queries like chrome-ux-report.all.* to aggregate all YYYYMM data, but now these queries are failing because the field name is different. I'm looking for a fix that can accommodate either old or new field names. Here's a simplified example:

SELECT
  COALESCE(first_input.delay.histogram.bin, experimental.first_input_delay.histogram.bin) AS fid
FROM
  `chrome-ux-report.all.*`

This results in an error that first_input_delay doesn't exist in the schema for the experimental struct:

Error: Field name first_input_delay does not exist in STRUCT<time_to_first_byte STRUCT<histogram STRUCT<bin ARRAY<STRUCT<start INT64, end INT64, density FLOAT64>>>>>` at [2:58]

Of course, the field exists in that struct for some of the tables covered by the wildcard but not others. It seems the validator only looks at the most recent table.

So my question is whether it's possible to use something like COALESCE to accommodate a field that gets renamed across tables? I understand that the schema makes this harder for us and a better solution is to use a single partitioned table but I'd like to hear if this is solvable given our current setup.

3

There are 3 answers

0
Felipe Hoffa On BEST ANSWER

Try offering your users a view - a starting point could be:

CREATE OR REPLACE VIEW `fh-bigquery.public_dump.chrome_ux_experimental_input_delay_view_202001`
AS
SELECT * EXCEPT(experimental)
  , experimental.first_input_delay.histogram.bin AS fid
  , CONCAT('2018', _table_suffix) ts
FROM `chrome-ux-report.all.2018*` 
UNION ALL
SELECT * EXCEPT(largest_contentful_paint,  experimental), experimental.first_input_delay.histogram.bin
  , CONCAT('20190', _table_suffix) ts
FROM `chrome-ux-report.all.20190*`  
UNION ALL
SELECT * EXCEPT(largest_contentful_paint,  experimental), experimental.first_input_delay.histogram.bin
  , '201910'
FROM `chrome-ux-report.all.201910`   
UNION ALL
SELECT * EXCEPT(largest_contentful_paint,  experimental, first_input, layout_instability), first_input.delay.histogram.bin
  , '201911'
FROM `chrome-ux-report.all.201911`   
UNION ALL
SELECT * EXCEPT(largest_contentful_paint,  experimental, first_input, layout_instability), first_input.delay.histogram.bin
  , '201912'
FROM `chrome-ux-report.all.201912`   

Now your users can run queries like:

SELECT ts, origin, fid
FROM `fh-bigquery.public_dump.chrome_ux_experimental_input_delay_view_202001` 
LIMIT 10

enter image description here

Ps: These tables really need to be clustered - this query would process significantly less bytes if the tables were so.

2
rmesteves On

Try the following:

SELECT
  #Use coalesce for all the fields existing in the two tables#
  COALESCE(t1.first_input.delay.histogram.bin, t2.experimental.first_input_delay.histogram.bin) AS fid
FROM
(SELECT * FROM  `tables-with-old-field`) t1 FULL OUTER JOIN
(SELECT * FROM  `tables-with-new-field`) t2
ON t1.primary_key = t2.primary_key 

Just edited the query. Please let me know if it works

4
rtenha On

The * wildcard character unions tables together, so the COALESCE will only have one or the other available to use. When you call COALESCE with both columns as arguments, it will fail.

You will want to process each schema differently and then union them.

with old_stuff as (
  -- Process the old data
  select some stuff
  from `chrome-ux-report.all.*`
  where _TABLE_SUFFIX <= '201910'
),
new_stuff as (
  -- Process the new data
  select and rename some stuff
  from `chrome-ux-report.all.*`
  where _TABLE_SUFFIX >= '201911'
),
unioned as (
  select * from old_stuff 
  union all 
  select * from new_stuff
)
select * from unioned

Select, Rename, and Cast as needed within each CTE.