I have a table with certain dates and product types
Product Process_Date Prod_Count
A 2015-08-01 2
A 2015-09-03 2
A 2015-05-06 3
B 2014-01-01 4
B 2014-12-31 5
I am trying to compute for each row, the count of product entries in the last 6 months from the process date (including the process_date). So the output should look like
Product Process_Date Count
A 2015-08-01 5
A 2015-09-03 7
A 2015-05-06 3
B 2014-01-01 4
B 2014-12-31 5
Currently I am doing a self-join
SELECT
q.Product, q.Process_Date, SUM(Prod_Count) AS num_180days
FROM tableT p
INNER JOIN (
SELECT DISTINCT Product, Process_Date
FROM tableT
) q
ON p.Product = q.Product
WHERE Process_Date BETWEEN DATE_SUB(q.Process_Date, 180) AND q.Process_Date
GROUP BY q.Product, q.Process_Date;
This query takes really long since the table is pretty big. I was wondering if there is an alternative using windowed functions to do this.
Thanks!
This can be accomplished simply by using Hive windows analytic functions although it's not obvious that this is true given the state of the documentation. I find that the official Hive documentation is out-of-date and often incorrect. I found this treasure by reading through the Hive JIRAs. For more information on additional window analytic function capabilities here is the original JIRI: https://issues.apache.org/jira/browse/HIVE-4112.
There is a strange restriction that the window analytic function can't contain multiplication of integers. The work around is to pre-multiply hence 15552000 = 60*60*24*180 seconds or 180 days.
I hope this helps!