Hive aggregate function in OVER clause

661 views Asked by At

I'm working with Hive 1.2.0 and so cannot use aggregates functions in over clause like :

SELECT rank() OVER (ORDER BY sum(b))

Because it is only available starting hive 2.1.

I'm stuck trying to resolve my problem without this functionality.

For example I have a dataset like this where every line represent a purchase from client_id in store_id :

| Date       |CLIENT_ID| STORE_ID |
| 2017-01-01 |       1 |       S1 |
| 2017-01-02 |       1 |       S2 |
| 2017-01-03 |       1 |       S2 |
| 2017-01-04 |       1 |       S3 |

I would like to get for each date, the homestore of every client, which is the store where he has purchased the most during the last year. If 2 stores have the same values, we take the most recent.

The result for this example would be :

| Date       |CLIENT_ID| STORE_ID | HOMESTORE |
| 2017-01-01 |       1 |       S1 |        S1 | -- S1 Because there is no other store before
| 2017-01-02 |       1 |       S2 |        S2 | -- 1 purchase in S1 and S2 but S2 is the most recent
| 2017-01-03 |       1 |       S2 |        S2 | -- 2 purchases in S2
| 2017-01-04 |       1 |       S3 |        S2 | -- 2 purchases in S2 vs 1 in S3

With an aggregate function in the over clause, it could be solved with something like this (Althought the 1 year clause is still missing) :

SELECT 
LAST_VALUE(STORE_ID) OVER(
    PARTITION BY CLIENT_ID
    ORDER BY COUNT(STORE_ID) ASC
    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as homestore

Do you have any idea how to resolve this ?

0

There are 0 answers