Understanding execution plan and parallelization in PostgreSQL

97 views Asked by At

I have a table that stores 1M 384-dimensional vectors in the column called vector of type "char"[]. I am trying to speed up the following query by using parallelization:

EXPLAIN ANALYZE
WITH Vars(key) as (
    VALUES (array_fill(1, ARRAY[384])::vector)
)
SELECT content_id
FROM MyTable, Vars
ORDER BY vector::int[]::vector <#> key
LIMIT 10;

The key is just a toy vector consisting of all ones. <#> is the dot product operator of the pgvector extension, and vector is the type defined by that extension, which to my understanding is similar to real[].

I am running this query in the free tier of AWS RDS. The Postgres instance has two vCPUs, so I expect to see an improvement from using two workers. Given the high dimensionality of the vectors, I expect computing the dot products to dominate the execution time and hence an improvement by almost a factor of 2 from using two workers.

To run without concurrency, I do:

set max_parallel_workers = 1;
set max_parallel_workers_per_gather = 1;
set enable_parallel_hash = off;

The output is:

 Limit  (cost=94267.29..94268.44 rows=10 width=12) (actual time=15624.961..15634.530 rows=10 loops=1)
   ->  Gather Merge  (cost=94267.29..161913.85 rows=588231 width=12) (actual time=15624.958..15634.524 rows=10 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=93267.28..94737.86 rows=588231 width=12) (actual time=15607.302..15607.305 rows=7 loops=2)
               Sort Key: ((((mytable.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on mytable  (cost=0.00..80555.82 rows=588231 width=12) (actual time=0.413..15452.274 rows=500000 loops=2)
 Planning Time: 10.502 ms
 Execution Time: 15635.121 ms
(11 rows)

Next, we force two workers:

set force_parallel_mode = on;
set max_parallel_workers = 2;
set max_parallel_workers_per_gather = 2;
set enable_parallel_hash = on;

Here is the output:

 Limit  (cost=83268.20..83269.37 rows=10 width=12) (actual time=14369.219..14379.656 rows=10 loops=1)
   ->  Gather Merge  (cost=83268.20..180496.59 rows=833328 width=12) (actual time=14369.217..14379.647 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=82268.18..83309.84 rows=416664 width=12) (actual time=14352.711..14352.714 rows=9 loops=3)
               Sort Key: ((((mytable.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on mytable  (cost=0.00..73264.22 rows=416664 width=12) (actual time=0.611..14204.459 rows=333333 loops=3)
 Planning Time: 7.062 ms
 Execution Time: 14380.487 ms
(12 rows)

The main question is: how come the expected time improvement is not observed? However, I would really appreciate a walk through the output of EXPLAIN ANALYZE. In particular:

  • The numbers of rows shown do not seem to reflect the actual number of rows in the table (i.e. exactly 1M).
  • Where can I see the time measurements/estimates related to computation of the dot products?
  • How come it says 12 rows in the second output? (I believe it should be 11 rows: one for the titles and ten for the data)
  • And most importantly, what do the outputs of EXPLAIN ANALYZE reveal about the reasons for not getting the expected time improvement?
1

There are 1 answers

1
Laurenz Albe On

You have a basic misunderstanding: the parallel workers are started in addition to the backend process, which by default also does its share of the work. So if you set max_parallel_workers_per_gather = 1, you don't disable parallel query, but limit it to one additional process. Set the parameter to 0 to disable parallel query.

With this, it is easier to answer your questions:

  1. With two processes, each reads 500000 rows, and with three processes each reads 333333 rows.

    The estimated row count for the sort and the gather nodes show how many rows PostgreSQL would estimate without the LIMIT, while the actual row count stops short as soon as it has fetched enough rows to satisfy the LIMIT. The discrepancy is surprising until you know that. However, PostgreSQL still considers the LIMIT in its cost estimate.

  2. I assume that the “dot products” are function calls. You can see them if you use EXPLAIN (ANALYZE, VERBOSE): then you see the function call in the output columns of the node where the function is evaluated. The function execution time is not listed separately, but is included in the execution time of the node.

    Note that you can get function execution statistics from pg_stat_user_functions if you have enabled track_functions.

  3. The twelve rows are the output of EXPLAIN, not the output of the query.

  4. You can see that the execution time of the parallel sequential scan does not become much shorter with a second parallel worker. Perhaps your disk is saturated. If you want to determine whether the time is spent in the CPU or doing I/O, enable track_io_timing.