Is there a faster alternative to "group by" aggregation in Netezza?

1.2k views Asked by At

This the minimal query statement I want to execute.

    select count(*) from temper_300_1 group by onegid;

I do have "where" clauses to go along as well though. What I am trying to do is build a histogram query and determine the number of elements with a particular "onegid". the query takes about 7 seconds on 800 million rows. Could someone suggest a faster alternative or optimization.

I was to actually trying plot a heatmap from a spatial data consisting of latitudes and longitudes, I have assigned a grid id to each elements, but the "group by aggregation" is coming out to be pretty costly in terms of time.

2

There are 2 answers

0
Jeremy Fortune On

You're not going to get much faster than group by, though your current query won't display which group item is associated with each count.

Make sure that the table is properly distributed with

select datasliceid, count(1) from temper_300_1 group by onegid;

The counts should be roughly equal. If they're not, your DBA needs to redistribute the table on a better distribution key.

If it is, you could ask your DBA to create a materialized view on that specific column, ordered by that column. You may see some performance gains.

0
ScottMcG On

I would say that there are two primary considerations for performance related to your query: distribution and row size/extent density.

Distribution:

As @jeremytwfortune mentions, it is important that your data be well distributed with little skew. In an MPP system such as Netezza, you are only as fast as your slowest data slice, and if one data slice has 10x the data as the rest it will likely drag your performance down.

The other distribution consideration is that if your table is not already distributed on onegid, it will be dynamically redistributed on onegid when the query runs in support of your GROUP BY onegid clause. This will happen for GROUP BYs and windowed aggregates with PARTITION BYs. If the distribution of onegid values is not relatively even you may be faced with processing skew.

If your table is already distributed on onegid and you don't supply any other WHERE predicates then you are probably already optimally configured from that standpoint.

Row Size / Extent Density

As Netezza reads data to support your query each data slice will read the its disk in 3 MB extents. If your row is substantially wider than just the onegid value, you will be reading more data from the disk than you need in order to answer your query. If your table is large, your rows are wider than just onegid, and query time performance is paramount, then you might consider creating a materialized view, like so:

CREATE MATERIALIZED VIEW temper_300_1_mv AS select onegid from temper_300_1 ORDER BY onegid;

When you execute your query against temper_300_1 with only onegid in the SELECT clause, the optimizer will refer to the materialized view only, which will be able to pack more rows into a given 3MB extent. This can be a significant performance boost.

The ORDER BY clause in the MVIEW creation statement will also likely increase the effectiveness of compression of the MVIEW, further reducing the number of extents required to hold a given number of rows, and further improving performance.