Hive collect_set query

86 views Asked by At

I have a following table in hive

    CREATE TABLE orders (
    customer_id INT,
    date DATE,
    product_id INT
);
 
INSERT INTO orders VALUES
    (1, '2024-02-18', 101),
    (1, '2024-02-18', 102),
    (1, '2024-02-19', 101),
    (1, '2024-02-19', 103),
    (2, '2024-02-18', 104),
    (2, '2024-02-18', 105),
    (2, '2024-02-19', 101),
    (2, '2024-02-19', 106);

And I am trying to get the following result view

date       | partition_set   |
|------------|-----------------|
| 2024-02-18 | [101]           |
| 2024-02-18 | [101, 102]      |
| 2024-02-18 | [102]           |
| 2024-02-18 | [104]           |
| 2024-02-18 | [104, 105]      |
| 2024-02-18 | [105]           |
| 2024-02-19 | [101]           |
| 2024-02-19 | [101, 103]      |
| 2024-02-19 | [103]           |
| 2024-02-19 | [106]           |
| 2024-02-19 | [101, 106]      |


I have tried the following query - 

 

select `date`, customer_id , COLLECT_SET(TRIM(upper(product_id)))  
     over(PARTITION by  `date`, customer_id ORDER by customer_id,upper(product_id)) prd
     from reporting.tmp_orders;

And I get the following result - 

date    | customer_id   | prd |
|-----|-----|----|
|2024-02-18 | 1 |["101"]|
|2024-02-18 |   1   | ["101","102"]
| 2024-02-18 |  2   | ["104"]
| 2024-02-19 |  2   | ["101"]
| 2024-02-19 |  2   | ["101","106"]
| 2024-02-18 |  2   | ["104","105"]
| 2024-02-19 |  1   | ["101"]
| 2024-02-19 |  1   | ["101","103"]

In the above result table I should be getting 3 rows corresponding to 2024-02-18 per my desired result table for customer 1, i.e. rows corresponding where prd has values [101], [101,102] & [102]. However I am missing [102]

Where I am going wrong. I have tried other options too but the above is the closed I got to & hence sharing it here. This is easier to achieve in python but I need this to be built in Hive.

1

There are 1 answers

0
7ATOM7 On

as the partitions are based on id and date and there are only 2 records for the combination the result is as expected. Window is defined on the combination of customer_id and date