Question
Say you have some simple data about some purchases:
| user_id | order_date | product_id |
|---|---|---|
| 001 | mon | 2e1 |
| 001 | mon | 44h |
| 001 | tues | e6f |
| 002 | wed | 6g3 |
| 002 | wed | 43m |
| 003 | wed | k19 |
| 003 | fri | 9d5 |
And I need to aggregate the product IDs into an array column, e.g. using COLLECT_SET, grouping by user_id and order_date. HOWEVER I also wish to retain the product_id column, as so:
| user_id | order_date | product_id | product_ids |
|---|---|---|---|
| 001 | mon | 2e1 | ["2e1","44h"] |
| 001 | mon | 44h | ["2e1","44h"] |
| 001 | tues | e6f | ["e6f"] |
| 002 | wed | 6g3 | ["6g3","43m"] |
| 002 | wed | 43m | ["6g3","43m"] |
| 003 | wed | k19 | ["k19"] |
| 003 | fri | 9d5 | ["9d5"] |
Problem
I can easily create the array column with the following query:
SELECT user_id,
order_date,
COLLECT_SET(product_id) AS product_ids
FROM table t
GROUP BY user_id, order_date
But that way I don't get the product_id column for every row, which I need.
Meanwhile if I include the product_id as so:
SELECT user_id,
order_date,
product_id,
COLLECT_SET(product_id) AS product_ids
FROM table t
GROUP BY user_id, order_date, product_id
Then the product_ids column will always be an array of length one, ie:
| user_id | order_date | product_id | product_ids |
|---|---|---|---|
| 001 | mon | 2e1 | ["2e1"] |
| 001 | mon | 44h | ["44h"] |
And of course if I exclude product_id from the GROUP BY then I get an error: "Expression not in GROUP BY key 'product_id"
Is it possible to do this in a single simple query, without e.g. creating a temp table and then joining them on user_id and order_date? Thanks!
The reason you're not getting the correct result is you're simply using the COLLECT_SET function on rows, grouping on all other columns (which would return the same table).
You can aggregate the table on user_id and order_date, and create an aggregate dataset. Then join the main table with this aggregated dataset based on those 2 columns and you'd get the expected result.
Here's the query to do the same. First the main table, joined with the aggregate table, and the select table tables the
COLLECT_SETresult asproduct_idscolumn here.The subquery would return the following dataset
Then the overall query's result would be