Does PostgreSQL self join ignore indexes?

2.2k views Asked by At

I have the following table in Postgresql 8.4.12:

           Table "public.ratings"
 Column |          Type          | Modifiers
--------+------------------------+-----------
 userid | character varying(128) |
 item   | character varying(128) |
 score  | integer                |
Indexes:
    "ratings_item" btree (item)
    "ratings_ui" btree (userid, item)
    "ratings_userid" btree (userid)

I would like to perform a self join to find items rated by all users that rated a specific item. For simplicity sake, I'll use a query to just get the number of ratings for each suspect similar item like so;

select r2.item,sum(1)
from ratings r1
left join ratings r2 using (userid)
where r1.item='an3.php'
group by r2.item

The query works but for my table with 36 million records, it takes forever. When I explain the statement, I get the following:

 GroupAggregate  (cost=8102958.42..8247621.18 rows=16978 width=17)    ->  Sort  (cost=8102958.42..8151108.60 rows=19260072 width=17)
         Sort Key: r2.item
         ->  Hash Left Join  (cost=1458652.29..4192647.43 rows=19260072 width=17)
               Hash Cond: ((r1.userid)::text = (r2.userid)::text)
               ->  Bitmap Heap Scan on ratings r1  (cost=868.20..77197.24 rows=24509 width
=22)
                     Recheck Cond: ((item)::text = 'an3.php'::text)
                     ->  Bitmap Index Scan on ratings_item  (cost=0.00..862.07 rows=24509 width=0)
                           Index Cond: ((item)::text = 'an3.php'::text)
               ->  Hash  (cost=711028.93..711028.93 rows=36763293 width=39)
                     ->  Seq Scan on ratings r2  (cost=0.00..711028.93 rows=36763293 width
=39)

From experience, I assume the "Seq Scan on ratings r2" is the culprit.

On another note, if I search for an item that doesn't exist:

select r2.item,sum(1) from ratings r1 left join ratings r2 using (userid)
where r1.item='targetitem' group by r2.item;

It appears to work fine (ie no results are returned and it is immediate)

GroupAggregate  (cost=2235887.19..2248234.70 rows=16978 width=17)    ->  Sort  (cost=2235887.19..2239932.29 rows=1618038 width=17)
         Sort Key: r2.item
         ->  Nested Loop Left Join  (cost=0.00..1969469.94 rows=1618038 width=17)
               ->  Index Scan using ratings_item on ratings r1  (cost=0.00..8317.74 rows=2 059 width=22)
                     Index Cond: ((item)::text = 'targetitem'::text)
               ->  Index Scan using ratings_userid on ratings r2  (cost=0.00..947.24 rows= 419 width=39)
                     Index Cond: ((r1.userid)::text = (r2.userid)::text) 

The same table and query works fine in MySQL but I'm not in a position to migrate my recommender system to another database.

Did I do something wrong or is this something with Postgres? Is there a work around?

1

There are 1 answers

0
Erwin Brandstetter On

To answer the (rhetorical) question in the title: No.

I see quite a few problems here, starting in the first line.

Postgres 8.4 has reached EOL last year. Nobody should be using it any more, it's too old. Upgrade to a current version if at all possible.

Barring that, you should at least be at the latest minor version. 8.4.12 was released 2012-06-04 and is missing two years of bug and security fixes. 8.2.23 is the last release for the dead version.
Read the versioning policy of the project.

Next, varchar(128) is very inefficient as PK / FK, especially for a table with millions of rows. Needlessly big and expensive to process. Use integer or bigint instead. Or UUID if you really need a bigger number space (I doubt it).

Next, I don't see a UNIQUE or PRIMARY KEY constraint on (userid, item) (which would obsolete an additional index on the same). Either your table definition is lacking or your query is wrong, or your question is broken.

Try this rewritten query:

SELECT r2.item, count(*) AS ct
FROM  (
   SELECT userid
   FROM   ratings
   WHERE  item = 'an3.php'
   GROUP  BY 1  -- should not be necessary, but constraint is missing
   ) r1
JOIN   ratings r2 USING (userid)
GROUP  BY 1;

In modern Postgres you would need two indexes for best performance. On (item, userid) and (userid, item).

In Postgres 9.2+ you might even get index-only scans out of this. I am not sure how to get the best out of your outdated version. Either way, varchar(128) is an expensive data type for indexes, too.