In Postgres, how to match multiple "tags" for best performance?

16.1k views Asked by At

Table: articles

+--------+------+------------+
| id     | title|    created |
+--------+------+------------+
|    201 | AAA  | 1482561011 |
|    202 | BBB  | 1482561099 |
|    203 | CCC  | 1482562188 |
+--------+------+------------+

Table: taggings

+-----------+------+
| articleid | tagid|
+-----------+------+
|    201    | 11   |
|    201    | 12   |
|    202    | 11   |
|    202    | 13   |
|    202    | 14   |
+-----------+------+

Now if given 3 tag ids, what is the best index design and query to select latest 10 articles that each article match the 3 tag ids at the same time?
I know there can be several ways to do it, but I'm concerning the performance, considering there maybe tens of thousands of articles in each tag

4

There are 4 answers

0
Philip Tzou On

You need have an index on articles.created for sorting, and another unique index on taggings(articleid, tagid) for querying:

CREATE INDEX ON articles(created);
CREATE UNIQUE INDEX ON taggings(articleid, tagid);

Then just make a select query with three taggings table aliases:

SELECT a.* FROM articles a, taggings t1, taggings t2, taggings t3
    WHERE a.id=t1.articleid AND a.id=t2.articleid AND a.id=t3.articleid
    AND t1.tagid=111 AND t2.tagid=222 AND t3.tagid=333
    ORDER BY created DESC LIMIT 10;
0
AudioBubble On
select distinct on (a.id) a.*
from articles a 
  join taggings t on t.articleid = a.id
group by a.id 
having array_agg(t.tagid order by t.tagid) = array[11,13,14]
order by a.id, a.created
limit 10;

An index on taggings (articleid, tagid) will help for this.

Note that the above looks for articles with exactly those three tags. If you want to find those with at least those three tags (and possibly more) you can change the having clause to use the "contains" operator:

select distinct on (a.id) a.*
from articles a 
  join taggings t on t.articleid = a.id
where t.tagid in (11,13,14)
group by a.id 
having array_agg(t.tagid) @> array[11,13,14]
order by a.id, a.created
limit 10;

In that case the order by for array_agg() is not necessary

0
kolypto On

In my case I had to apply a complex condition to every "tag", and @> was of no use. So I found another approach: a grid sensor array:

ARRAY[
  bool_or(true if tag1 is present), 
  bool_or(true if tag2 is present),
  ...
] = ARRAY[true, true, ...]

Example:

SELECT a.*
FROM articles a JOIN tags t ON(t.articleid = a.id)
GROUP BY a.id
HAVING 
  ARRAY[
    bool_or(t.tagid == 11),
    bool_or(t.tagid == 13),
    bool_or(t.tagid == 14)
  ] == ARRAY[true, true, true]

it has poor performance, but great flexibility for many2many relationships.

2
Fredrik On

As a_horse_with_no_name mentioned, this blog post has some really interesting performance benchmarks for finding rows matching more than one tag:

http://www.databasesoup.com/2015/01/tag-all-things.html

Storing tags in an array column in the main table and creating a GIN-index allows rows to be selected like this, without any joins:

select id
from articles
where tags @> array[11,13,14]
order by created desc
limit 10;

The column and index can be created like this:

alter table articles add column tags text[] not null default '{}';
create index tags_index on articles using gin (tags);

According to the blog, finding rows matching two tags was between 8 and 895 times faster when using the array-column than when joining in a tags table.