I have two tables, one for profiles and one for the profile's employment status. The two tables have a one-to-one relationship. One profile might might not have an employment status. The table schemas are as below (irrelevant columns removed for clarity):
create type employment_status as enum ('claimed', 'approved', 'denied');
create table if not exists profiles
(
id bigserial not null
constraint profiles_pkey
primary key
);
create table if not exists employments
(
id bigserial not null
constraint employments_pkey
primary key,
status employment_status not null,
profile_id bigint not null
constraint fk_rails_d95865cd58
references profiles
on delete cascade
);
create unique index if not exists index_employments_on_profile_id
on employments (profile_id);
With these tables, I was asked to list all unemployed profiles. An unemployed profile is defined as a profile not having an employment record or having an employment with a status other than "approved".
My first tentative was the following query:
SELECT * FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE employments.status != 'approved'
The assumption here was that all profiles will be listed with their respective employments, then I could filter them with the where condition. Any profile without an employment record would have an employment status of null
and therefore be filtered by the condition. However, this query did not return profiles without an employment.
After some research I found this post, explaining why it doesn't work and transformed my query:
SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';
Which actually did work. But, my ORM produced a slightly different query, which didn't work.
SELECT profiles.* FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'
The only difference being the select clause. I tried to understand why this slight difference produced such a difference an ran an explain analyze all three queries:
EXPLAIN ANALYZE SELECT * FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE employments.status != 'approved'
Hash Join (cost=14.28..37.13 rows=846 width=452) (actual time=0.025..0.027 rows=2 loops=1)
Hash Cond: (e.profile_id = profiles.id)
-> Seq Scan on employments e (cost=0.00..20.62 rows=846 width=68) (actual time=0.008..0.009 rows=2 loops=1)
Filter: (status <> ''approved''::employment_status)
Rows Removed by Filter: 1
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.007..0.007 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.003..0.004 rows=8 loops=1)
Planning Time: 0.111 ms
Execution Time: 0.053 ms
EXPLAIN ANALYZE SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';
Hash Right Join (cost=14.28..37.13 rows=846 width=452) (actual time=0.036..0.042 rows=8 loops=1)
Hash Cond: (employments.profile_id = profiles.id)
-> Seq Scan on employments (cost=0.00..20.62 rows=846 width=68) (actual time=0.005..0.005 rows=2 loops=1)
Filter: (status <> ''approved''::employment_status)
Rows Removed by Filter: 1
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.015..0.015 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.010..0.011 rows=8 loops=1)
Planning Time: 0.106 ms
Execution Time: 0.108 ms
EXPLAIN ANALYZE SELECT profiles.* FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'
Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.006..0.007 rows=8 loops=1)
Planning Time: 0.063 ms
Execution Time: 0.016 ms
The first and second query plans are almost the same expect for the hash join for one and right hash join for the other, while the last query doesn't even do join or the where condition.
I came up with a forth query that did work:
EXPLAIN ANALYZE SELECT profiles.* FROM profiles
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE (employments.id IS NULL OR employments.status != 'approved')
Hash Right Join (cost=14.28..35.02 rows=846 width=384) (actual time=0.021..0.026 rows=7 loops=1)
Hash Cond: (employments.profile_id = profiles.id)
Filter: ((employments.id IS NULL) OR (employments.status <> ''approved''::employment_status))
Rows Removed by Filter: 1
-> Seq Scan on employments (cost=0.00..18.50 rows=850 width=20) (actual time=0.002..0.003 rows=3 loops=1)
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.011..0.011 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.007..0.008 rows=8 loops=1)
Planning Time: 0.104 ms
Execution Time: 0.049 ms
My questions about the subject is:
- Why are the query plans for second and third queries different even though they have the same structure?
- Why are the query plans the first and fourth queries different even though they the same structure?
- Why is Postgres totally ignoring my join and where condition for the third query?
EDIT:
With the following sample data, the expected query should return 2 and 3.
insert into profiles values (1);
insert into profiles values (2);
insert into profiles values (3);
insert into employments (profile_id, status) values (1, 'approved');
insert into employments (profile_id, status) values (2, 'denied');
There must be a unique or primary key constraint on
employments.profile_id
(or it is a view with an appropriateDISTINCT
clause) so that the optimizer knows that there can be at most one row inemployments
that is related to a given row inprofiles
.If that is the case and you don't use
employments
's columns in theSELECT
list, the optimizer deduces that the join is redundant and need not be calculated, which makes for a simpler and faster execution plan.See the comment for
join_is_removable
insrc/backend/optimizer/plan/analyzejoins.c
: