Sum over a date range per group in Hive

6.5k views Asked by At

I have a table with certain dates and product types

Product  Process_Date Prod_Count
A         2015-08-01    2
A         2015-09-03    2
A         2015-05-06    3
B         2014-01-01    4
B         2014-12-31    5

I am trying to compute for each row, the count of product entries in the last 6 months from the process date (including the process_date). So the output should look like

Product  Process_Date  Count
A         2015-08-01    5
A         2015-09-03    7
A         2015-05-06    3
B         2014-01-01    4
B         2014-12-31    5

Currently I am doing a self-join

SELECT
q.Product, q.Process_Date, SUM(Prod_Count) AS num_180days
FROM tableT p
INNER JOIN (
  SELECT DISTINCT Product, Process_Date
  FROM tableT
) q
ON p.Product = q.Product
WHERE Process_Date BETWEEN DATE_SUB(q.Process_Date, 180) AND q.Process_Date
GROUP BY q.Product, q.Process_Date; 

This query takes really long since the table is pretty big. I was wondering if there is an alternative using windowed functions to do this.

Thanks!

1

There are 1 answers

0
BushMinusZero On

This can be accomplished simply by using Hive windows analytic functions although it's not obvious that this is true given the state of the documentation. I find that the official Hive documentation is out-of-date and often incorrect. I found this treasure by reading through the Hive JIRAs. For more information on additional window analytic function capabilities here is the original JIRI: https://issues.apache.org/jira/browse/HIVE-4112.

There is a strange restriction that the window analytic function can't contain multiplication of integers. The work around is to pre-multiply hence 15552000 = 60*60*24*180 seconds or 180 days.

SELECT
    product,
    process_date,
    prod_count,
    sum(prod_count) OVER (
        PARTITION BY product 
        ORDER BY unix_timestamp(process_date,'yyyy-MM-dd') ASC 
        RANGE BETWEEN 15552000 PRECEDING and CURRENT ROW) as count
FROM tableT

I hope this helps!