PostgreSQL not using index on a filtered multiple sort query

Question

PostgreSQL not using index on a filtered multiple sort query

5.6k views Asked by Mike Fairhurst At 22 June 2015 at 18:58

I have a pretty simple table

CREATE TABLE approved_posts (
  project_id INTEGER,
  feed_id INTEGER,
  post_id INTEGER,
  approved_time TIMESTAMP NOT NULL,
  post_time TIMESTAMP NOT NULL,
  PRIMARY KEY (project_id, feed_id, post_id)
)

And I'm trying to optimize this query:

SELECT *
FROM approved_posts
WHERE feed_id IN (?, ?, ?)
AND project_id = ?
ORDER BY approved_time DESC, post_time DESC
LIMIT 1;

The query optimizer is fetching every single approved_post that matches the predicate, sorting all 100k results, and returning the top one it finds.

I do have an index on project_id, feed_id, approved_time, post_time, which it will use if I either:
A. remove the sort by post_time, or
B. replace the IN (?, ?, ?) with a single = ?.
Then it simply does a reverse index scan to get the first result and its blazingly fast.

Option A:

 Limit  (cost=0.43..6.57 rows=1 width=24) (actual time=0.101..0.101 rows=1 loops=1)
   ->  Index Scan Backward using approved_posts_approved_time_idx on approved_posts p  (cost=0.43..840483.02 rows=136940 width=24) (actual time=0.100..0.100 rows=1 loops=1)
     Filter: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
     Rows Removed by Filter: 37
 Total runtime: 0.129 ms

Option B:

Limit  (cost=0.43..3.31 rows=1 width=24) (actual time=0.065..0.065 rows=1 loops=1)
   ->  Index Scan Backward using approved_posts_full_pagination_index on approved_posts p  (cost=0.43..126884.70 rows=44049 width=24) (actual time=0.063..0.063 rows=1 loops=1)
     Index Cond: ((project_id = 148772) AND (feed_id = 73321))
 Total runtime: 0.092 ms

But without these tweaks it is not so performant ...

Limit  (cost=169792.16..169792.17 rows=1 width=24) (actual time=510.225..510.225 rows=1 loops=1)
   ->  Sort  (cost=169792.16..170118.06 rows=130357 width=24) (actual time=510.224..510.224 rows=1 loops=1)
     Sort Key: approved_time, post_time
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Bitmap Heap Scan on approved_posts p  (cost=12324.41..169140.38 rows=130357 width=24) (actual time=362.210..469.387 rows=126260 loops=1)
           Recheck Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
           ->  Bitmap Index Scan on approved_posts_feed_id_idx  (cost=0.00..12291.82 rows=130357 width=0) (actual time=354.496..354.496 rows=126260 loops=1)
                 Index Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
Total runtime: 510.265 ms

I can even add a conditional index on these 5 feed ids and it will once again do the right thing.

My current best solution is to put every feed_id in its own query and do a massive UNION between them all. But this doesn't scale very well as I might want to select the top 500 from 30 feeds, pulling in 15k rows and sorting them for no good reason. Also managing offsets with this strategy is somewhat complex.

Does anybody know how I can do this IN clause with two sorts on my well-indexed data and get Postgres to do the right thing?

I'm using Postgres 9.3.3. Here are my indexes:

 "approved_posts_project_id_feed_id_post_id_key" UNIQUE CONSTRAINT, btree (project_id, feed_id, post_id)
 "approved_posts_approved_time_idx" btree (approved_time)
 "approved_posts_feed_id_idx" btree (feed_id)
 "approved_posts_full_pagination_index" btree (project_id, feed_id, approved_time, post_time)
 "approved_posts_post_id_idx" btree (post_id)
 "approved_posts_post_time_idx" btree (post_time)
 "approved_posts_project_id_idx" btree (project_id)

None of the columns are nullable.

This table has 2m rows, split among 200 feed IDs and 19 project IDs.

These are the most common feed IDs:

 feed_id | count  
---------+--------
   73607 | 558860
   73837 | 354018
   73832 | 220285
   73836 | 172664
   73321 | 118695
   73819 |  95999
   73821 |  75871
   73056 |  65779
   73070 |  54655
   73827 |  43710
   73079 |  36700
   73574 |  36111
   73055 |  25682
   73072 |  22596
   73589 |  19856
   73953 |  15286
   73159 |  13059
   73839 |   8925

In terms of min/max/avg cardinality per feedid/projectid pairing, we have:

 min |  max   |          avg          
-----+--------+-----------------------
   1 | 559021 | 9427.9140271493212670

Original Q&A

There are 2 answers

CoryatJohn On 22 June 2015 at 19:17

From what I understand, if the first "where" isn't the first part of the key, the key will not be used. Try switching the order of your "where's" in your query to project_id and feed_id.

**Erwin Brandstetter** · Accepted Answer · 2015-06-23T00:00:26+00:00

With a list of possible values for feed_id, Postgres has a hard time to find the best query plan. Each feed_id could be associated with 1 - 559021 rows (according to your numbers). Postgres is not currently smart enough to see the potential optimization for the special case of LIMIT 1 on its own. A UNION ALL (not just UNION) of several queries with one feed_id and LIMIT 1 each, plus another outer LIMIT 1 (like you seem to have tried) demonstrates the potential, but requires sophisticated query concatenation for a variable number of input values.

There is another way to convince the query planner it can use index scans to pick the first row from the index for each feed_id: rewrite your query with a LATERAL join:

SELECT a.*
FROM   (VALUES (?), (?), (?)) AS t(feed_id)
     , LATERAL (
   SELECT *
   FROM   approved_posts
   WHERE  project_id = ?
   AND    feed_id = t.feed_id
   ORDER  BY approved_time DESC, post_time DESC
   LIMIT  1
   ) a
ORDER  BY approved_time DESC, post_time DESC
LIMIT  1;

Or, more convenient for a variable number of values for feed_id:

SELECT a.*
FROM   unnest(?) AS t(feed_id)  -- provide int[] var
     , LATERAL ( ...

Pass an integer array for the variable, like '{123, 234, 345}'::int[]. This could also be implemented elegantly with a function using a VARIADIC parameter. Then you can pass a list of integer values:

Pass multiple values in single parameter

Your index on (project_id, feed_id, approved_time, post_time) works for this since Postgres can scan indexes backwards almost as fast as forwards, but (project_id, feed_id, approved_time DESC, post_time DESC) would be even better. See:

Optimizing queries on a range of timestamps (two columns)

If you don't need to return all columns of the table, even index-only scans might be an option.

Your columns approved_time, post_time are defined NOT NULL. Else, you have to do more:

Unused index in range of dates query

Related answer detailing the LATERAL join technique:

Optimize GROUP BY query to retrieve latest record per user

Why did your option A work?

A closer look reveals two things:

->  Index Scan Backward using approved_posts_approved_time_idx
    on approved_posts p (cost=0.43..840483.02 rows=136940 width=24)
                        (actual time=0.100..0.100 rows=1 loops=1)
     Filter: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))

Bold emphasis mine.

A different, smaller index on just (approved_time) is used.
There is no index condition on feed_id (which would not be possible in this case), but a Filter.

Postgres chooses a completely different strategy: it reads rows from this index bottom-up (Index Scan Backward) until it finds a row matching one of your given values for feed_id. Since you only have very few projects and feeds (200 feed IDs and 19 project IDs), chances are it won't have to discard too many rows before the first match - which is the result. This actually gets faster with more values for feed_id, because the "latest" row is found earlier - unlike my first approach which is faster for fewer values.

A promising alternative strategy! Depending on data distribution and the feeds in your query it may be faster than my first solution - enable it with this index:

"approved_posts_foo_idx" btree (project_id, approved_time DESC, post_time DESC)

It may pay to selectively increase the statistics targets for the columns project_id and feed_id so the tipping point between both strategies can be estimated more accurately.

Postgresql - Query running a lot faster with enable_nestloop=false. Why is the planner not doing the right thing?

Since you have projects with only old rows (as per comment), you might improve this query with a hint as to the maximum approved_time (and post_time, but that's probably not adding much) - if you know the maximum approved_time per project (and / or per feed_id), or at least an upper bound.

SELECT ...
WHERE  ...
AND   approved_time <= $upper_bound

TechQA.

PostgreSQL not using index on a filtered multiple sort query

There are 2 answers

Why did your option A work?

Related Questions in SQL

Related Questions in POSTGRESQL

Related Questions in SORTING

Related Questions in INDEXING

Related Questions in POSTGRESQL-PERFORMANCE

Popular Questions

Popular Tags

Trending Questions