PostgreSQL: best way to join small subsets of large tables

2.1k views Asked by At

I am using PostgreSQL 9.3 on a Windows machine. Let's say I have a clients table with 4 million records and an orders table with 25 million records. However, I am interested in my New York clients only. There are only 5,000 New York clients who placed 15,000 orders, i.e. a drastically smaller subset.

What is the best way to retrieve the client ids and the total number of orders ever placed by New York clients?

Is a correlated subquery like

Select c.clientid
, ( select count(orders.clientid) from orders where orders.clientid = c.clientid) as NumOrders

From clients c
WHERE c.city = 'New York'

faster than a join like

Select c.clientid
,coalesce(o.NumOrders,0) as NumOrders

From clients c

Left outer join
( select clientid, count(*) as NumOrders from orders group by clientid ) o
on c.clientid = o.clientid

WHERE c.city = 'New York'

because the latter spends most of the time counting records which are then discarded since they don't relate to New York clients? Or is there a better way?

Thank you!

PS Yes, I know, I should look at the execution plan, but I am writing this from home and I don't have a database with millions of records to test this on.

2

There are 2 answers

2
khampson On

As you alluded to, the only way to truly know is to compare the execution plans. In fact, the best way would be to use EXPLAIN ANALYZE, so that it actually executes the query and inserts the results into output with the estimates, so you can get a sense of the query planner versus reality.

However, in general, what I would do in a situation like this would probably be to create a temp table for client subset and then JOIN that to the orders table. You could optionally use WITH instead to do everything in one query.

So, something like:

CREATE TEMP TABLE tmp_clients AS
SELECT c.clientid
FROM clients c
WHERE c.city = 'New York'
ORDER BY c.clientid;

SELECT *
FROM orders AS o
JOIN tmp_clients AS c ON (o.clientid = c.clientid)
ORDER BY o.clientid;

This way, tmp_clients contains only the New York clients -- ~5K rows -- and it's that table that will be joined to the orders table.

You could also, to optimize further, create an index on the temp table (on the clientid) and then ANALYZE it before doing the JOIN to ensure that the JOIN is done purely on the index. You'd want to check the query plans in each case to see the relative difference (or just keep this in mind if the JOIN isn't quite as fast as you would like).

Response to comment from @poshest:

That sounds like the temp tables are stacking up, which would increase the memory footprint, and, for a long-running connection, functionality appear to be a memory leak.

In that case, it wouldn't be a true leak, though, as temp tables are scoped to a connection. They disappear automatically, but not until after the connection ends. However, you can make them disappear right away when you're done with them. Simply DROP the table as you would any other once you're done with them, and I suspect you'll be able to call the function a bunch of times -- on the same connection -- without the same sort of monotonic memory footprint increase.

5
Mike Sherrill 'Cat Recall' On

In the first case, you're using a subquery that will have to execute at least once per matching client id number. It will perform poorly.

In the second case, you're using a left outer join on a subquery, which is a little odd. It will preserve values for clients who have no orders. This is a little bit like talking about customers who have never bought anything--are they really customers? I'm going to say "No", because this is my answer and I can make up anything I like. But I've seen cases where it does make sense to talk about clients you've never worked for. (Law firms sometimes have clients who have no billable hours.)

Anyway . . .

You're almost certainly better off with a simple, straightforward expression of what you're trying to count.

select o.clientid, count(*)
from orders o
inner join clients c on c.clientid = o.clientid
       and c.city = 'New York'
group by o.clientid;

You'll want an index on "city".


I built, populated, and indexed some tables with random(ish) data. The simple, straightforward query was faster by an order of magnitude.

explain analyze
Select c.clientid
      ,coalesce(o.NumOrders,0) as NumOrders
From clients c
Left outer join
    ( select clientid, count(*) as NumOrders from orders group by clientid ) o
on c.clientid = o.clientid
WHERE c.city = 'New York';
"Merge Right Join  (cost=4813633.08..5022834.83 rows=5200 width=12) (actual time=105751.121..117901.326 rows=5000 loops=1)"
"  Merge Cond: (orders.clientid = c.clientid)"
"  ->  GroupAggregate  (cost=4799762.91..4996891.51 rows=962770 width=4) (actual time=105702.262..117735.305 rows=1000000 loops=1)"
"        ->  Sort  (cost=4799762.91..4862263.21 rows=25000120 width=4) (actual time=105688.877..113148.881 rows=25000000 loops=1)"
"              Sort Key: orders.clientid"
"              Sort Method: external merge  Disk: 342176kB"
"              ->  Seq Scan on orders  (cost=0.00..360621.20 rows=25000120 width=4) (actual time=14.866..35814.499 rows=25000000 loops=1)"
"  ->  Sort  (cost=13870.18..13883.18 rows=5200 width=4) (actual time=39.536..40.241 rows=5000 loops=1)"
"        Sort Key: c.clientid"
"        Sort Method: quicksort  Memory: 427kB"
"        ->  Bitmap Heap Scan on clients c  (cost=144.73..13549.22 rows=5200 width=4) (actual time=29.556..30.638 rows=5000 loops=1)"
"              Recheck Cond: ((city)::text = 'New York'::text)"
"              ->  Bitmap Index Scan on clients_city_idx  (cost=0.00..143.43 rows=5200 width=0) (actual time=29.538..29.538 rows=5000 loops=1)"
"                    Index Cond: ((city)::text = 'New York'::text)"
"Total runtime: 118027.256 ms"
explain analyze
select o.clientid, count(*)
from orders o
inner join clients c on c.clientid = o.clientid
       and c.city = 'New York'
group by o.clientid;
"GroupAggregate  (cost=595747.05..596315.80 rows=32500 width=4) (actual time=9167.518..9179.855 rows=1250 loops=1)"
"  ->  Sort  (cost=595747.05..595828.30 rows=32500 width=4) (actual time=9167.504..9174.135 rows=31336 loops=1)"
"        Sort Key: o.clientid"
"        Sort Method: external merge  Disk: 432kB"
"        ->  Hash Join  (cost=13614.22..593311.47 rows=32500 width=4) (actual time=3.055..9123.568 rows=31336 loops=1)"
"              Hash Cond: (o.clientid = c.clientid)"
"              ->  Seq Scan on orders o  (cost=0.00..360621.20 rows=25000120 width=4) (actual time=0.041..3833.387 rows=25000000 loops=1)"
"              ->  Hash  (cost=13549.22..13549.22 rows=5200 width=4) (actual time=2.915..2.915 rows=5000 loops=1)"
"                    Buckets: 1024  Batches: 1  Memory Usage: 176kB"
"                    ->  Bitmap Heap Scan on clients c  (cost=144.73..13549.22 rows=5200 width=4) (actual time=0.672..1.769 rows=5000 loops=1)"
"                          Recheck Cond: ((city)::text = 'New York'::text)"
"                          ->  Bitmap Index Scan on clients_city_idx  (cost=0.00..143.43 rows=5200 width=0) (actual time=0.658..0.658 rows=5000 loops=1)"
"                                Index Cond: ((city)::text = 'New York'::text)"
"Total runtime: 9180.291 ms"