I'm experimenting with PostgreSQL (v9.3). I have a quite large database, and often I need to execute queries with 8-10 joined tables (as source of large data grids). I'm using Devexpress XPO as the ORM above PostgreSQL, so unfortunately I don't have any control over how joins are generated.
The following example is a fairly simplified one, the real scenario is more complex, but as far as my examination the main problem can be seen on this too.
Consider the following variants of the (semantically) same query:
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN orderdetails od ON o.details = od.oid
LEFT JOIN customers c ON o.customer = c.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN customers c ON o.customer = c.oid
LEFT JOIN orderdetails od ON o.details = od.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
The orders
table contains about 1 million rows, and the customers
about 30 thousand. The order details contains the same amount as orders
due to a one-to-one relation.
UPDATE: It seems like the example is too simplified to reproduce the issue, because I checked again and in this case the two execution plain is identical. However in my real query where there are much more joins, the problem occures: if I put customers as the first join, the execution is 100x faster. I'll add my real query, but due to the hungarian language and the fact that it's been generated by XPO and Npgsql makes it less readable.
The first query is significantly slower (about 100x) than the second, and when I output the plans with EXPLAIN ANALYZE I can see that the order of the joins reflects to their position in the query string. So firstly the two "giant" tables are joined together, and then after the filtered customer table is joined (where the filter selects only one row).
The second query is faster because the join starts with that one customer row, and after that it joins the 20-30 order details rows.
Unfortunately in my case XPO generates the first version so I'm suffering with performance.
Why PostgreSQL query planner not noticing that the join on customers has a condition in the WHERE clauuse? IMO the correct optimization would be to take those joins first which has any kind of filter, and then take those joins which participate only in selection.
Any kind of help or advice is appreciated.
Join orders only matters, if your query's joins not collapsed. This is done internally by the query planner, but you can manipulate the process with the
join_collapse_limit
runtime option.Note however, the query planner will not find every time the best join order by default:
For the best performance, I recommend to use some kind of native querying, if available. Raising the
join_collapse_limit
can be a good-enough solution though, if you ensure, this hasn't caused other problems.Also worth to mention, that raising
join_collapse_limit
will most likely increase the planning time.