Impala Resource Estimation for queries with Group by

800 views Asked by At

I noticed that Impala "Estimated Per-Host Requirements" grow potentially when my queries use a "group by" with several fields. I suppose it calculates the maximum resouces needed for a join:

EXPLAIN select field1, field2
from mytable  where field1=123
group by field1, field2
order by field1, field2
limit 100;

I would like to know if there is a way to reduce the estimated value by Impala, because the real needed resources were far lower (300 MB) than the amount estimated (300 GB).

It is important to say that "field1" and "field2" are String.

1

There are 1 answers

1
Matt On

Unfortunately it is difficult to estimate the required memory based on information known at query planning time which are based on limited statistics that are available, especially when dealing with aggregations and joins that depend on the selectivity of the grouping/join exprs.

Firstly, are you sure you have up-to-date statistics on the table(s) you're using? Run COMPUTE STATS [table] to do so.

If you still have this issue with the correct stats, you can set the set mem_limit=XM query option to tell Impala that the query shouldn't use more than X MB of memory so it will request that amount of memory from Llama rather than the estimate from planning. If you're sure the query doesn't use more than 300MB, you can issue set mem_limit=300M; and then issue your query. If you're running other queries after from the same session, then clear the query option afterwards.