Efficient way to filter the "one" side by values in the "many" side in a one-to-many relationship?

108 views Asked by At

I have a table users in a Postgres database that has a one-to-many relationship to another table users_attributes, a simple key-value type table with a foreign key column to the users table.

create table users(
  id: uuid primary key, 
  name: varchar
);

create table users_attributes(
  attribute_id: integer primary key,
  user_id: uuid references users(id),
  attribute_name: varchar, 
  attribute_value: varchar
);

I need to filter users based on attribute_name and attribute_value in the users_attributes table. I have tried this query, which works but it takes a lot longer to execute:

select * from users u
left join users_attributes ua1 on u.id = ua1.user_id and ua1.attribute_name = 'dog_name'
left join users_attributes ua2 on u.id = ua2.user_id and ua2.attribute_name = 'cat_name'
where ua1.attribute_value = 'Spot' and ua2.attribute_value = 'Mittens';

The number of joins go up for each attribute that I need to filter the users by. This is causing the query to slow down (between 4-10 seconds depending on number of joins) since there are approximately a hundred thousand users. An explain plan on the query supports this.

How do I query the users in a way that returns faster?

2

There are 2 answers

0
Erwin Brandstetter On

The mix of LEFT JOIN and WHERE conditions makes no sense, logically. See:

Basic rewrite:

SELECT *
FROM   users u
JOIN   users_attributes ua1 ON u.id = ua1.user_id
JOIN   users_attributes ua2 ON u.id = ua2.user_id
WHERE  ua1.attribute_name = 'dog_name'
AND    ua1.attribute_value = 'Spot'
AND    ua2.attribute_name = 'cat_name'
AND    ua2.attribute_value = 'Mittens';

Basically, it's a case of .

There are many ways to do this. The best query style depends on your cardinalities, your typical filters, and what you are optimizing for. Here is a whole arsenal:

The query I gave is among the fastest options. You need matching indexes, of course. And everything would be more efficient with proper normalization, where the attribute name moves into a separate table attribute, and attribute_id is an integer FK pointing there. An index on user_attribute(attribute_id, user_id) (two integer columns) would be ideal. See:

The query would resolve attribute names to integer IDs (explicitly, or implicitly in the query), and proceed with those.

7
bobflux On

This type of query is the typical "craigslist query" that searches based on attributes (manufacturer, model, etc)... it can also apply to a dating site, for example.

Let's build some test data.

CREATE UNLOGGED TABLE users( user_id INTEGER NOT NULL );
INSERT INTO users SELECT generate_series( 1, 1000000 );
ALTER TABLE users ADD PRIMARY KEY( user_id );

CREATE UNLOGGED TABLE users_attrs( 
 user_id INTEGER NOT NULL, 
 attr_id INTEGER NOT NULL  );
INSERT INTO users_attrs SELECT user_id, aid FROM (
    SELECT user_id, aid, 0.5/aid > random() x
    FROM generate_series(1,20) aid CROSS JOIN users ) foo
    WHERE x;
ALTER TABLE users_attrs ADD PRIMARY KEY (user_id,attr_id);
CREATE INDEX users_attrs_au ON users_attrs( attr_id, user_id );
SELECT attr_id,count(*) FROM users_attrs GROUP BY 1 ORDER BY 2;

 attr_id | count
---------+--------
      20 |  25104
      19 |  26570
      18 |  27638
      17 |  29574
      16 |  30982
      15 |  33490
      14 |  35574
      13 |  38473
      12 |  41816
      11 |  45373
      10 |  49641
       9 |  55793
       8 |  62471
       7 |  71386
       6 |  83123
       5 |  99592
       4 | 124920
       3 | 166107
       2 | 250662
       1 | 500446

I did not put the attribute name in users_attrs since this should be put in a separate table.

For simplicity, I'm not using attribute values. Whether we use an index on (attribute_id,user_id) or (attribute_id,attribute_value,user_id) the result is going to be the same for the sake of performance measurement. When searching, what is important is the probability distribution, in other words the selectivity of the search criteria.

For example suppose you're looking for "women 25-30 years old near you" on a dating website. Searching based on "sex" first would be a terrible strategy because it has a selectivity of 50%, so the database would have to read half the table, then reject most of it due to the other criteria. Using the most selective criteria first gives much better performance. Hence my simulated probability distribution.

So we have a million users with 20 attributes ; some are very frequent like attribute 1 which is set in 50% of users while others are rare like 20 which is only set in 2.4% of users.

VACUUM ANALYZE;

Let's do a naive search for the very common attributes 1 and 2:

EXPLAIN ANALYZE SELECT *
    FROM users_attrs u1
    JOIN users_attrs u2 USING (user_id)
    WHERE u1.attr_id=1 AND u2.attr_id=2;

 Merge Join  (rows=125248)
   Merge Cond: (u1.user_id = u2.user_id)
   ->  Index Only Scan using users_attrs_au on users_attrs u1
         Index Cond: (attr_id = 1)
   ->  Index Only Scan using users_attrs_au on users_attrs u2
         Index Cond: (attr_id = 2)
 Execution Time: 93.706 ms

Observations:

  • Index-only scan saves the day, by allowing an efficient merge join. I don't see any index in your question, so you should try adding an index on (attribute_id,attribute_value,user_id) in that order because that will enable searching for attribute_id with a specific value (since these are the first two columns) then get the user_id's directly without even looking at the table.

  • It's rather slow (~100ms) and doesn't scale well.

  • The search returns 125k rows, which means it is useless. The user is going to look at the number of pages displayed, sigh, and enter more targeted search criteria. This means the resources were wasted (especially the sorting, which I didn't add in the query).

Now let's search on one attribute having multiple values, which I will simulate by searching for ids (1 or 2) and 3.

 EXPLAIN ANALYZE SELECT *
    FROM users_attrs u1
    JOIN users_attrs u2 USING (user_id)
    WHERE u1.attr_id BETWEEN 1 AND 2 AND u2.attr_id=3;

 Hash Join  (rows=124719)
   Hash Cond: (u1.user_id = u2.user_id)
   ->  Index Only Scan using users_attrs_au on users_attrs u1
         Index Cond: ((attr_id >= 1) AND (attr_id <= 2))
   ->  Hash
         ->  Index Only Scan using users_attrs_au on users_attrs u2  
               Index Cond: (attr_id = 3)
 Execution Time: 151.845 ms

Change of plan: in the previous case the index would yield user_id's in order, allowing an efficient merge join. In this case it does not, so postgres uses a hash. Same remarks as above apply.

Now let's search for two rare attributes.

EXPLAIN ANALYZE SELECT *
    FROM users_attrs u1
    JOIN users_attrs u2 USING (user_id)
    WHERE u1.attr_id=19 AND u2.attr_id=20;

 Merge Join  (cost=0.85..1565.88 rows=892 width=12) (actual time=0.223..9.917 rows=659 loops=1)
   Merge Cond: (u1.user_id = u2.user_id)
   ->  Index Only Scan using users_attrs_au on users_attrs u1  (cost=0.43..710.83 rows=24823 width=8) (actual time=0.096..3.495 rows=26570 loops=1)
         Index Cond: (attr_id = 19)
   ->  Index Only Scan using users_attrs_au on users_attrs u2  (cost=0.43..721.11 rows=25182 width=8) (actual time=0.030..3.284 rows=25103 loops=1)
         Index Cond: (attr_id = 20)
 Execution Time: 9.988 ms

That's pretty nice. It's fast, the row count estimates are good and the final result is usable: with a bit of sorting the user should find something in there.

Note the search queries that destroy your server are always the useless ones. They're the ones that try to return a gorillion rows, even if it worked fast no-one would ever read the results.

Now let's search for one common and two rare attributes.

EXPLAIN ANALYZE SELECT *
    FROM users_attrs u1
    JOIN users_attrs u2 USING (user_id)
    JOIN users_attrs u3 USING (user_id)
    WHERE u1.attr_id=1 AND u2.attr_id=19 AND u3.attr_id=20;

 Nested Loop  (cost=1.28..2674.39 rows=636 width=16) (actual time=0.173..9.837 rows=335 loops=1)
   ->  Merge Join  (cost=0.85..1565.88 rows=892 width=16) (actual time=0.117..8.189 rows=659 loops=1)
         Merge Cond: (u2.user_id = u3.user_id)
         ->  Index Only Scan using users_attrs_au on users_attrs u2  (cost=0.43..710.83 rows=24823 width=8) (actual time=0.041..2.838 rows=26570 loops=1)
               Index Cond: (attr_id = 19)
               Heap Fetches: 0
         ->  Index Only Scan using users_attrs_au on users_attrs u3  (cost=0.43..721.11 rows=25182 width=8) (actual time=0.012..2.660 rows=25103 loops=1)
               Index Cond: (attr_id = 20)
               Heap Fetches: 0
   ->  Index Only Scan using users_attrs_au on users_attrs u1  (cost=0.43..1.24 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=659)
         Index Cond: ((attr_id = 1) AND (user_id = u2.user_id))
         Heap Fetches: 0
 Planning Time: 1.449 ms
 Execution Time: 9.907 ms

This is also good. I purposefully put the joins in the wrong order: PG noticed the two rare attributes and reordered the joins to search them first (it's the merge join in the middle which returns 659 rows). Then it checks if the resulting rows have the common attribute or not, keeping 335 rows. So it avoids scanning through the 500k rows with common attribute #1, which is exactly what we want.

In your case, with attribute values, it's a bit more complicated because the default statistics accumulated by postgres and used by the query planner are per-column only. So you may want to enable multivariate statistics on (attribute_id,attribute_value) to get better estimates.

But the most important thing is the correct index as mentioned above.

If your attribute values are fixed per attribute (ie, multiple-choice questions) then you can assign an id number to all attribute-value pairs and my example applies directly.

Your problems also maps quite exactly to... full text search. You could use a fulltext search engine, they're optimized exactly for that. Say, if the attribute is dog_name='rex' you could put all the user's attributes in a text field, in the form "dog_name__rex" for example... and shove that into Lucene or Xapian for millisecond search times on enormous datasets.

Postgres does have a fulltext module, but it's not that fast. However if you can map your problem to "assign an id number to all attribute-value pairs" then you can use its backend, which is module intarray:

CREATE UNLOGGED TABLE users_attrs_a( user_id INTEGER NOT NULL, attr_ids INTEGER[] );
INSERT INTO users_attrs_a SELECT user_id, array_agg(attr_id) FROM users_attrs GROUP BY user_id;
CREATE INDEX users_attrs_a_rdtree_idx ON users_attrs_a USING GIST (attr_ids gist__int_ops);
VACUUM ANALYZE users_attrs_a;
EXPLAIN ANALYZE SELECT * FROM users_attrs_a WHERE attr_ids @> '{1,19,20}';
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on users_attrs_a  (cost=27.85..1438.56 rows=444 width=33) (actual time=7.224..7.770 rows=335 loops=1)
   Recheck Cond: (attr_ids @> '{1,19,20}'::integer[])
   Heap Blocks: exact=328
   ->  Bitmap Index Scan on users_attrs_a_rdtree_idx  (cost=0.00..27.74 rows=444 width=0) (actual time=7.197..7.197 rows=335 loops=1)
         Index Cond: (attr_ids @> '{1,19,20}'::integer[])
 Planning Time: 0.326 ms
 Execution Time: 7.846 ms

It's a little bit faster and scales better. On your table size, not worth the hassle.