TPC-DS Query 6: Why do we need 'where j.i_category = i.i_category' condition?

Question

TPC-DS Query 6: Why do we need 'where j.i_category = i.i_category' condition?

136 views Asked by Bumhwan Kim At 15 December 2020 at 08:29

I'm going through TPC-DS for Amazon Athena.

It was fine until query 5.

I got some problem on query 6. (which is below)

select  a.ca_state state, count(*) cnt
 from customer_address a
     ,customer c
     ,store_sales s
     ,date_dim d
     ,item i
 where       a.ca_address_sk = c.c_current_addr_sk
    and c.c_customer_sk = s.ss_customer_sk
    and s.ss_sold_date_sk = d.d_date_sk
    and s.ss_item_sk = i.i_item_sk
    and d.d_month_seq = 
         (select distinct (d_month_seq)
          from date_dim
               where d_year = 2002
            and d_moy = 3 )
    and i.i_current_price > 1.2 * 
             (select avg(j.i_current_price) 
         from item j 
         where j.i_category = i.i_category)
 group by a.ca_state
 having count(*) >= 10
 order by cnt, a.ca_state 
 limit 100;

It took more than 30 minutes so it failed with timeout.

I tried to find which part cause problem, so I checked the where conditions and I found where j.i_category = i.i_category for the last part of where condition.

I don't know why this condition is needed so I deleted this part and the query ran Ok.

can you guys tell me why this part is needed?

Original Q&A

There are 1 answers

**Piotr Findeisen** · Answer 1 · 2020-12-15T13:24:05+00:00

The j.i_category = i.i_category is subquery correlation condition. If you remove it from the subquery

select avg(j.i_current_price) 
from item j 
where j.i_category = i.i_category)

the subquery becomes uncorrelated, and becomes a global aggregation on the item table, which is easy to calculate and the query engine needs to do it once.

If you want a fast, performant query engine on AWS, i can recommend Starburst Presto (disclaimer: i am from Starburst). See https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/ for a related comparison (note: this is not a comparison with Athena).

If it doesn't have to be that fast, you can use PrestoSQL on EMR (note that "PrestoSQL" and "Presto" components on EMR are not the same thing).

TechQA.

TPC-DS Query 6: Why do we need 'where j.i_category = i.i_category' condition?

There are 1 answers

Related Questions in SQL

Related Questions in AMAZON-ATHENA

Related Questions in TPC

Popular Questions

Popular Tags

Trending Questions