Query planner differences in similar queries

48 views Asked by At

I have two tables, one for profiles and one for the profile's employment status. The two tables have a one-to-one relationship. One profile might might not have an employment status. The table schemas are as below (irrelevant columns removed for clarity):

create type employment_status as enum ('claimed', 'approved', 'denied');

create table if not exists profiles
(
    id bigserial not null
        constraint profiles_pkey
            primary key
);

create table if not exists employments
(
    id bigserial not null
        constraint employments_pkey
            primary key,
    status employment_status not null,
    profile_id bigint not null
        constraint fk_rails_d95865cd58
            references profiles
                on delete cascade
);

create unique index if not exists index_employments_on_profile_id
    on employments (profile_id);

With these tables, I was asked to list all unemployed profiles. An unemployed profile is defined as a profile not having an employment record or having an employment with a status other than "approved".

My first tentative was the following query:

SELECT * FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE employments.status != 'approved'

The assumption here was that all profiles will be listed with their respective employments, then I could filter them with the where condition. Any profile without an employment record would have an employment status of null and therefore be filtered by the condition. However, this query did not return profiles without an employment.

After some research I found this post, explaining why it doesn't work and transformed my query:

SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';

Which actually did work. But, my ORM produced a slightly different query, which didn't work.

SELECT profiles.* FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'

The only difference being the select clause. I tried to understand why this slight difference produced such a difference an ran an explain analyze all three queries:

EXPLAIN ANALYZE SELECT * FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE employments.status != 'approved'

Hash Join  (cost=14.28..37.13 rows=846 width=452) (actual time=0.025..0.027 rows=2 loops=1)
  Hash Cond: (e.profile_id = profiles.id)
  ->  Seq Scan on employments e  (cost=0.00..20.62 rows=846 width=68) (actual time=0.008..0.009 rows=2 loops=1)
        Filter: (status <> ''approved''::employment_status)
        Rows Removed by Filter: 1
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.007..0.007 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.003..0.004 rows=8 loops=1)
Planning Time: 0.111 ms
Execution Time: 0.053 ms
EXPLAIN ANALYZE SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';

Hash Right Join  (cost=14.28..37.13 rows=846 width=452) (actual time=0.036..0.042 rows=8 loops=1)
  Hash Cond: (employments.profile_id = profiles.id)
  ->  Seq Scan on employments  (cost=0.00..20.62 rows=846 width=68) (actual time=0.005..0.005 rows=2 loops=1)
        Filter: (status <> ''approved''::employment_status)
        Rows Removed by Filter: 1
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.015..0.015 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.010..0.011 rows=8 loops=1)
Planning Time: 0.106 ms
Execution Time: 0.108 ms
EXPLAIN ANALYZE SELECT profiles.* FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'

Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.006..0.007 rows=8 loops=1)
Planning Time: 0.063 ms
Execution Time: 0.016 ms

The first and second query plans are almost the same expect for the hash join for one and right hash join for the other, while the last query doesn't even do join or the where condition.

I came up with a forth query that did work:

EXPLAIN ANALYZE SELECT profiles.* FROM profiles 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE (employments.id IS NULL OR employments.status != 'approved')

Hash Right Join  (cost=14.28..35.02 rows=846 width=384) (actual time=0.021..0.026 rows=7 loops=1)
  Hash Cond: (employments.profile_id = profiles.id)
  Filter: ((employments.id IS NULL) OR (employments.status <> ''approved''::employment_status))
  Rows Removed by Filter: 1
  ->  Seq Scan on employments  (cost=0.00..18.50 rows=850 width=20) (actual time=0.002..0.003 rows=3 loops=1)
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.011..0.011 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.007..0.008 rows=8 loops=1)
Planning Time: 0.104 ms
Execution Time: 0.049 ms

My questions about the subject is:

  1. Why are the query plans for second and third queries different even though they have the same structure?
  2. Why are the query plans the first and fourth queries different even though they the same structure?
  3. Why is Postgres totally ignoring my join and where condition for the third query?

EDIT:

With the following sample data, the expected query should return 2 and 3.

insert into profiles values (1);
insert into profiles values (2);
insert into profiles values (3);

insert into employments (profile_id, status) values (1, 'approved');
insert into employments (profile_id, status) values (2, 'denied');
1

There are 1 answers

0
Laurenz Albe On

There must be a unique or primary key constraint on employments.profile_id (or it is a view with an appropriate DISTINCT clause) so that the optimizer knows that there can be at most one row in employments that is related to a given row in profiles.

If that is the case and you don't use employments's columns in the SELECT list, the optimizer deduces that the join is redundant and need not be calculated, which makes for a simpler and faster execution plan.

See the comment for join_is_removable in src/backend/optimizer/plan/analyzejoins.c:

/*
 * join_is_removable
 *    Check whether we need not perform this special join at all, because
 *    it will just duplicate its left input.
 *
 * This is true for a left join for which the join condition cannot match
 * more than one inner-side row.  (There are other possibly interesting
 * cases, but we don't have the infrastructure to prove them.)  We also
 * have to check that the inner side doesn't generate any variables needed
 * above the join.
 */