Say you want to optimize a query in a postgres database like:
SELECT DISTINCT ON (first)
first,
second,
third
FROM my_table
WHERE
second > 100
AND fourth = 3
ORDER BY
first,
second DESC,
third DESC
(EDIT: In this example, let's say fourth=3 is about 25% of rows, and second > 100 is only around 5%)
Where you want to select the first column for a table based on a couple filter conditions and ordering by three other conditions. As far as I know, the best way to do this would be to create an index on first and second, then an index on first, second DESC, third DESC. Unfortunately, the second index doesn't seem to be used when I analyze the query.
Is this the ideal way to create these indexes, or could there be a single index that unifies both the filtering and sorting.
Secondly, I'm wondering, is there a way to ensure you're picking the best index strategy given a table, or if this could be analytically determined based on your dataset and query?
When I run this now, this is my current output from explain:
Unique (cost=207985.91..208536.43 rows=18682 width=78) (actual time=823.330..965.769 rows=5248 loops=1)
-> Sort (cost=207985.91..208261.17 rows=110104 width=78) (actual time=823.328..935.933 rows=348232 loops=1)
Sort Key: first, second DESC, third DESC
Sort Method: external merge Disk: 31176kB
-> Index Scan using ix_my_table_second_fourth on my_table (cost=0.44..193872.52 rows=110104 width=78) (actual time=0.017..103.031 rows=348232 loops=1)
Index Cond: ((fourth = 3) AND (second > 100))
Planning Time: 0.315 ms
Execution Time: 971.174 ms
So you can see it uses the ix_my_table_second_fourth to filter, but a significant majority of the time is spent sorting the query so the value with the highest second and third value for each first column is attained.
All guesswork, based on incomplete information.
Server configuration
You currently suffer from insufficient
work_memas indicated by the mention of 'Disk' in you query plan.Increase the setting (at least locally for the big query) by at least 32 MB - until 'Disk' goes away. Consider discussion at the bottom of this answer:
Then again, you probably won't need more
work_memfor the following solution.Generally, insufficient
work_memindicates one of two things: Insufficient RAM and/or sub-optimal DB design, or potential for server configuration.Your row estimates are off by factor 3. More aggressive
autovacuumsettings might help. And no index-only scan in your plan. Maybe you don't tap into the full potential of a covering index. Run a manualVACUUM ANALYZE tbl, and check the effect on your query before doing anything else.Generic index for the generic query
For only few duplicate values on
firstin the selection,DISTINCT ONis probably the best query technique, and it should be most efficient to optimize the index for row selection rather than pre-sorting:second > 100is more selective thanfourth = 3, but that doesn't matter for the order of index columns as long as both are in the lead. The deciding factor: equality comes before range predicates. The columnfirstandthirddo not contribute to filtering, so those may as well move into theINCLUDEsection to make the index a bit smaller.Related:
Optimized index
Assuming from your comments that the filter
fourth = 3is constant, and the lower bound insecond > 100is steadily increasing. (A timestamp, really, filtered on current date.) So consider this partial index:Now we are aiming at pre-sorted data.
You didn't disclose the selectivity of the combined filters. The higher, the better for a "partial" approach.
With increasing bound (
second > 101etc.) the index keeps being applicable, but more and more rows have to be filtered over time. So recreate the index with increased bound from time to time. (Like once a month?) Can be automated in a cron job or similar. UseCREATE INDEX CONCURRENTLYandDROP INDEX CONCURRENTLYto minimize friction with concurrent write access (if needed).Since we are going to use a partial index anyway, adding the second filter on a moving target helps, even if the index is never recreated.
Only being used in the
WHEREclause,fourthcan now be removed from the index expressions, and index-only scans are still supported. This makes the index smaller (probably; again, missing information) which helps overall - especially with awork_membottleneck.Optimized query
Your original query should already benefit. Notably, if the table is vacuum'ed enough, you get index-only scan. But to exploit the full potential of the partial index, emulate an index skip scan:
Cardinalities in your query plan indicate 66 dupes on avg. per
firstvalue in the selection (rows 348232→rows 5248). Enough to test this approach:In combination with my partial index, you should see index (or index-only) scans, and no sort step in the query plan at all. Should out-perform the rest.
See: