Query Taking Ages to Run in Hive, any way to simplify it

Question

Query Taking Ages to Run in Hive, any way to simplify it

283 views Asked by mowen10 At 19 December 2016 at 13:30

I was looking for some help in regards to an SQL query that I am trying to run in Hive (Hue Platform) and was looking for some advice to see if the query could be shortened or written differently to make it run quicker. When I have ran this (see below) against a dummy data set (1500 rows) in SQL Developer this runs farily quickly and gives me what I need:

SELECT 
    P_AGENT,
    TRUNC(P_DATE) AS P_DATE,
    SUM(TOUCH_COUNT) AS TOUCH_COUNT,   
    MIN(P_DATE) AS START_TIME,
    MAX(P_DATE) AS END_TIME,
    MAX(P_DATE) - MIN(P_DATE) AS TIME_TAKEN
FROM 
    (
    SELECT 
        T1.*,
        SUM(COUNT1) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) AS GRP,
        CASE  
            WHEN P_DATE - LAG(P_DATE) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) <= 1 / 48  
            THEN NULL
            ELSE 1
         END AS TOUCH_COUNT
    FROM 
        (
          SELECT 
            T1.*,
            (CASE WHEN P_DATE - LAG(P_DATE) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) <= 1/48  
             THEN 0 ELSE 1
            END) AS COUNT1
          FROM 
            TABLE2 T1
        ) T1
    ) T1
GROUP BY
    P_AGENT, TRUNC(P_DATE), GRP
ORDER BY
    P_DATE
    ;

Unfortunately when I try to run the same query in Hive, the query takes ages to run and the max I have ran it for is 7 hours and I still dont have an output. The problem is that I am trying to run this against a table that has over 300 million rows hence the length of time it is taking.

I decided to break up the queries and put in the initial subquery into a table (dummy data shown below) and this was fine and the output I now have in a sub table with 347million rows (dummy data below only shows 10 rows)

CREATE TABLE

CREATE TABLE "TIME_ISSUES" 
   (    
    "P_DATE" DATE, 
    "P_AGENT" VARCHAR2(8 BYTE), 
    "COUNT1" NUMBER(5,0)
   )

INSERT STATEMENTS

Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('05-APR-16 20:20:12','DD-MON-RR HH24:MI:SS'),'CLQRC0',1); 
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('07-APR-16 15:06:09','DD-MON-RR HH24:MI:SS'),'SMCXF3',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('08-APR-16 04:33:00','DD-MON-RR HH24:MI:SS'),'EAQGH1',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('08-APR-16 12:17:53','DD-MON-RR HH24:MI:SS'),'JMENJDS',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('09-APR-16 13:06:53','DD-MON-RR HH24:MI:SS'),'JMENJDS',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('11-APR-16 10:41:00','DD-MON-RR HH24:MI:SS'),'NUKXY3',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('12-APR-16 10:15:21','DD-MON-RR HH24:MI:SS'),'JMRJADS',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('12-APR-16 11:19:23','DD-MON-RR HH24:MI:SS'),'CLMXB1',1); 
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('15-APR-16 08:36:00','DD-MON-RR HH24:MI:SS'),'EMA0L1',1);  
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('22-APR-16 16:22:00','DD-MON-RR HH24:MI:SS'),'EADBM1',1);

Unfortunately the initial table has 49 different columns but I have selected the 2 columns that I am using in the example above.

I have found that when running the second sub query, it looks the problem lies in the statement:

SUM(COUNT1) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) AS GRP,

The full query that I am running is shown here:

SELECT T1.*,
  SUM(COUNT1) OVER (PARTITION BY P_AGENT, TO_DATE(P_DATE) ORDER BY P_DATE) AS GRP,
  CASE
    WHEN P_DATE - LAG(P_DATE) over (PARTITION BY P_RECR, TO_DATE(P_DATE) ORDER BY P_DATE) <= 1 / 48
    THEN NULL
    ELSE 1
  END AS TOUCH_COUNT
FROM TIME_ISSUES T1 ;

but this takes ages to run and so I wanted to some advice to see if there is anything different that i could try as i am stuck on how to resolve this now.

The output that I am looking for will show all work that an agent has completed under a 30 minute timeframe.

An example of this from my data set is:

P_AGENT | P_DATE | TOUCH_COUNT | START_TIME | END _TIME | TIME_TAKEN
JMRJADS | 12-APR-16 | 1 | 12-APR-16 10:15:21 | 12-APR-16 10:15:21 | 0

Original Q&A

There are 1 answers

**David דודו Markovitz** · Answer 1 · 2016-12-19T13:44:09+00:00

David דודו Markovitz On 19 December 2016 at 13:44

No need for TOUCH_COUNT when you already calculated COUNT1 at the previous stage

TechQA.

Query Taking Ages to Run in Hive, any way to simplify it

There are 1 answers

Related Questions in SQL

Related Questions in HIVE

Related Questions in HUE

Popular Questions

Trending Questions