I'm working with Hive 1.2.0 and so cannot use aggregates functions in over clause like :
SELECT rank() OVER (ORDER BY sum(b))
Because it is only available starting hive 2.1.
I'm stuck trying to resolve my problem without this functionality.
For example I have a dataset like this where every line represent a purchase from client_id in store_id :
| Date |CLIENT_ID| STORE_ID |
| 2017-01-01 | 1 | S1 |
| 2017-01-02 | 1 | S2 |
| 2017-01-03 | 1 | S2 |
| 2017-01-04 | 1 | S3 |
I would like to get for each date, the homestore of every client, which is the store where he has purchased the most during the last year. If 2 stores have the same values, we take the most recent.
The result for this example would be :
| Date |CLIENT_ID| STORE_ID | HOMESTORE |
| 2017-01-01 | 1 | S1 | S1 | -- S1 Because there is no other store before
| 2017-01-02 | 1 | S2 | S2 | -- 1 purchase in S1 and S2 but S2 is the most recent
| 2017-01-03 | 1 | S2 | S2 | -- 2 purchases in S2
| 2017-01-04 | 1 | S3 | S2 | -- 2 purchases in S2 vs 1 in S3
With an aggregate function in the over clause, it could be solved with something like this (Althought the 1 year clause is still missing) :
SELECT
LAST_VALUE(STORE_ID) OVER(
PARTITION BY CLIENT_ID
ORDER BY COUNT(STORE_ID) ASC
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as homestore
Do you have any idea how to resolve this ?