I am using PostgreSQL 9.3 on a Windows machine. Let's say I have a clients table with 4 million records and an orders table with 25 million records. However, I am interested in my New York clients only. There are only 5,000 New York clients who placed 15,000 orders, i.e. a drastically smaller subset.
What is the best way to retrieve the client ids and the total number of orders ever placed by New York clients?
Is a correlated subquery like
Select c.clientid
, ( select count(orders.clientid) from orders where orders.clientid = c.clientid) as NumOrders
From clients c
WHERE c.city = 'New York'
faster than a join like
Select c.clientid
,coalesce(o.NumOrders,0) as NumOrders
From clients c
Left outer join
( select clientid, count(*) as NumOrders from orders group by clientid ) o
on c.clientid = o.clientid
WHERE c.city = 'New York'
because the latter spends most of the time counting records which are then discarded since they don't relate to New York clients? Or is there a better way?
Thank you!
PS Yes, I know, I should look at the execution plan, but I am writing this from home and I don't have a database with millions of records to test this on.
As you alluded to, the only way to truly know is to compare the execution plans. In fact, the best way would be to use
EXPLAIN ANALYZE
, so that it actually executes the query and inserts the results into output with the estimates, so you can get a sense of the query planner versus reality.However, in general, what I would do in a situation like this would probably be to create a temp table for client subset and then
JOIN
that to theorders
table. You could optionally useWITH
instead to do everything in one query.So, something like:
This way,
tmp_clients
contains only the New York clients -- ~5K rows -- and it's that table that will be joined to the orders table.You could also, to optimize further, create an index on the temp table (on the clientid) and then
ANALYZE
it before doing theJOIN
to ensure that the JOIN is done purely on the index. You'd want to check the query plans in each case to see the relative difference (or just keep this in mind if theJOIN
isn't quite as fast as you would like).Response to comment from @poshest:
That sounds like the temp tables are stacking up, which would increase the memory footprint, and, for a long-running connection, functionality appear to be a memory leak.
In that case, it wouldn't be a true leak, though, as temp tables are scoped to a connection. They disappear automatically, but not until after the connection ends. However, you can make them disappear right away when you're done with them. Simply
DROP
the table as you would any other once you're done with them, and I suspect you'll be able to call the function a bunch of times -- on the same connection -- without the same sort of monotonic memory footprint increase.