A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET
to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST
to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET
instead of UNNEST
to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
OFFSET - returns 1 result:
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.