Why Google BigQuery doesn't use partition date correctly when using views

2k views Asked by At

I have a date partitioned table (call it sample_table) with 2 columns, one to save dateTime in UTC and other to save timezone offset. I have a view on top of this table (call it sample_view). The view takes _partitiontime in from table and exposes that as partitionDate column and also there is another column customerDateTime which simply adds dateTime with timeZoneOffset.

When I query the sample_table directly using only _partitiontime bigquery scans far less data (131 MB).

select
  containerName,
  count(*)
from
  [sample_project.sample_table] 
where
  _partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

But when I run same query on the table with dateTime column to scan according to customer's local date time big query scans more (211MB). I expected less than 131MB or equal to 131MB.

select
  containerName,
  count(*)
from
  [sample_project.sample_table] 
where
  _partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
  and DATE_ADD(dateTime, 3600, 'SECOND' ) between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

When I run similar query against the sample_view with partitionDate bigquery scans more (399MB)

select
  containerName,
  count(*)
from
  [sample_project.sample_view] 
where
  partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

And when I run query against the view with partitionDate and use customerDateTime column as well bigquery scans even more (879MB)

select
  containerName,
  count(*)
from
  [sample_project.sample_view] 
where
  partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19') and customerDateTime between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

I'm not too sure whether I'm scanning right partitions from any of the queries above. Why do I see the differences between these queries? Is exposing _partitiontime as a new column partitionDate a bad strategy? I'm not sure how else to use the partition date within Tableau without writing more queries. Please let me know if you require more details.

2

There are 2 answers

4
Mikhail Berlyant On

Just guess - the problem you see is because you have repeated fields. Legacy and Standard SQL deal differently with flattening result. Legacy SQL does flatten result thus you see not count of original records but rather number of repeated values in them. Whereas Standard SQL keep original structure. In Legacy SQL you need to take extra care of eliminating effect of Flattening, while in Standard SQL it is already taken care of

3
Elliott Brossard On

You will probably need to use standard SQL for the query instead, since legacy SQL has some limitations in terms of filter pushdown. I'm not very familiar with Tableau myself, but they have a help page for BigQuery that talks about switching between legacy and standard SQL.