I have a following table in hive
CREATE TABLE orders (
customer_id INT,
date DATE,
product_id INT
);
INSERT INTO orders VALUES
(1, '2024-02-18', 101),
(1, '2024-02-18', 102),
(1, '2024-02-19', 101),
(1, '2024-02-19', 103),
(2, '2024-02-18', 104),
(2, '2024-02-18', 105),
(2, '2024-02-19', 101),
(2, '2024-02-19', 106);
And I am trying to get the following result view
date | partition_set |
|------------|-----------------|
| 2024-02-18 | [101] |
| 2024-02-18 | [101, 102] |
| 2024-02-18 | [102] |
| 2024-02-18 | [104] |
| 2024-02-18 | [104, 105] |
| 2024-02-18 | [105] |
| 2024-02-19 | [101] |
| 2024-02-19 | [101, 103] |
| 2024-02-19 | [103] |
| 2024-02-19 | [106] |
| 2024-02-19 | [101, 106] |
I have tried the following query -
select `date`, customer_id , COLLECT_SET(TRIM(upper(product_id)))
over(PARTITION by `date`, customer_id ORDER by customer_id,upper(product_id)) prd
from reporting.tmp_orders;
And I get the following result -
date | customer_id | prd |
|-----|-----|----|
|2024-02-18 | 1 |["101"]|
|2024-02-18 | 1 | ["101","102"]
| 2024-02-18 | 2 | ["104"]
| 2024-02-19 | 2 | ["101"]
| 2024-02-19 | 2 | ["101","106"]
| 2024-02-18 | 2 | ["104","105"]
| 2024-02-19 | 1 | ["101"]
| 2024-02-19 | 1 | ["101","103"]
In the above result table I should be getting 3 rows corresponding to 2024-02-18 per my desired result table for customer 1, i.e. rows corresponding where prd has values [101], [101,102] & [102]. However I am missing [102]
Where I am going wrong. I have tried other options too but the above is the closed I got to & hence sharing it here. This is easier to achieve in python but I need this to be built in Hive.
as the partitions are based on id and date and there are only 2 records for the combination the result is as expected. Window is defined on the combination of customer_id and date