I have a date partitioned table (call it sample_table
) with 2 columns, one to save dateTime in UTC and other to save timezone offset. I have a view on top of this table (call it sample_view
). The view takes _partitiontime
in from table and exposes that as partitionDate
column and also there is another column customerDateTime
which simply adds dateTime with timeZoneOffset.
When I query the sample_table
directly using only _partitiontime
bigquery scans far less data (131 MB).
select
containerName,
count(*)
from
[sample_project.sample_table]
where
_partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
But when I run same query on the table with dateTime
column to scan according to customer's local date time big query scans more (211MB). I expected less than 131MB or equal to 131MB.
select
containerName,
count(*)
from
[sample_project.sample_table]
where
_partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
and DATE_ADD(dateTime, 3600, 'SECOND' ) between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
When I run similar query against the sample_view
with partitionDate
bigquery scans more (399MB)
select
containerName,
count(*)
from
[sample_project.sample_view]
where
partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
And when I run query against the view with partitionDate
and use customerDateTime
column as well bigquery scans even more (879MB)
select
containerName,
count(*)
from
[sample_project.sample_view]
where
partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19') and customerDateTime between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
I'm not too sure whether I'm scanning right partitions from any of the queries above. Why do I see the differences between these queries? Is exposing _partitiontime as a new column partitionDate
a bad strategy? I'm not sure how else to use the partition date within Tableau without writing more queries. Please let me know if you require more details.
Just guess - the problem you see is because you have repeated fields. Legacy and Standard SQL deal differently with flattening result. Legacy SQL does flatten result thus you see not count of original records but rather number of repeated values in them. Whereas Standard SQL keep original structure. In Legacy SQL you need to take extra care of eliminating effect of Flattening, while in Standard SQL it is already taken care of