I'm going through TPC-DS for Amazon Athena.
It was fine until query 5.
I got some problem on query 6. (which is below)
select a.ca_state state, count(*) cnt
from customer_address a
,customer c
,store_sales s
,date_dim d
,item i
where a.ca_address_sk = c.c_current_addr_sk
and c.c_customer_sk = s.ss_customer_sk
and s.ss_sold_date_sk = d.d_date_sk
and s.ss_item_sk = i.i_item_sk
and d.d_month_seq =
(select distinct (d_month_seq)
from date_dim
where d_year = 2002
and d_moy = 3 )
and i.i_current_price > 1.2 *
(select avg(j.i_current_price)
from item j
where j.i_category = i.i_category)
group by a.ca_state
having count(*) >= 10
order by cnt, a.ca_state
limit 100;
It took more than 30 minutes so it failed with timeout.
I tried to find which part cause problem, so I checked the where conditions and I found where j.i_category = i.i_category
for the last part of where condition.
I don't know why this condition is needed so I deleted this part and the query ran Ok.
can you guys tell me why this part is needed?
The
j.i_category = i.i_category
is subquery correlation condition. If you remove it from the subquerythe subquery becomes uncorrelated, and becomes a global aggregation on the
item
table, which is easy to calculate and the query engine needs to do it once.If you want a fast, performant query engine on AWS, i can recommend Starburst Presto (disclaimer: i am from Starburst). See https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/ for a related comparison (note: this is not a comparison with Athena).
If it doesn't have to be that fast, you can use PrestoSQL on EMR (note that "PrestoSQL" and "Presto" components on EMR are not the same thing).