What makes bad row estimates a pain point in SQL query performance? I’m interested to know the internal reasons why.
Often a bad row estimate will actually pick the correct plan, and the only difference between a good query and a bad query will be the estimated row counts.
Why is there frequently such a massive performance difference?
Is it because Postgres uses row estimates to allocate memory?
Postgresql optimizer is a cost-based optimizer (CBO), queries will be executed by the smallest cost from execution plans, and the cost will calculate by the statistic of the table.
Because the wrong statistic might choose a bad execution plan. Here is an example
There are two tables,
T1
has 20000000 rows,T2
has 1000000 rows.when we do
join
on tables we will get an execution plan which might useMerge JOIN
but when I update a lot of rows as below let id plus
100000000
when id smaller than1000000
we use the same query again, it will use
Merge JOIN
, but There should be another better option instead ofMerge JOIN
.if you didn't hit the autovacuum_analyze_threshold (
autovacuum_analyze_threshold
default value was0.1
that mean we need to create more than10%
deadtuple postgresql will update statistic automatically)when we use manual
ANALYZE T1;
which mean updateT1
table statistic, then query again the query will getNested Loop
which is better thanMerge JOIN
small conclusion:
A precise statistic in table will help the optimizer get the right execution plan by precise COST from tables.
Here is a script that helps us search
last_analyze
&last_vacuum
the last time.