Postgres Text Search extremely slow

52 views Asked by At

Using PostGres 15.5

My table looks like this:

CREATE TABLE IF NOT EXISTS directory_network_entity (
  id varchar(70) NOT NULL,
  country_code_2 varchar(2) NOT NULL,
  country_name varchar(50) NOT NULL,
  entity_name varchar(200) not NULL,
  source_name varchar(32) NOt NULL,
  source_category varchar(32) DEFAULT NULL,
  raw jsonb,
  identifiers_search_vectors tsvector default null,
  general_search_vectors tsvector default null,
  create_time_utc bigint not null default (extract(epoch from now()) * 1000),
  PRIMARY KEY (id)
);
CREATE INDEX gin_identifiers_search_vectors ON directory_network_entity USING GIN (identifiers_search_vectors);

CREATE INDEX gin_general_search_vectors ON directory_network_entity USING GIN (general_search_vectors);

I have about 180K rows (not a lot).

When I run this query:

SELECT (ts_rank_cd(general_search_vectors, query,  32) + ( 2 * ts_rank_cd(identifiers_search_vectors , query,  32))) as rank, 
id, entity_name FROM directory_network_entity , to_tsquery('0106/30253335`') query
WHERE query @@ identifiers_search_vectors order by rank desc limit 10;

The query takes 5s to execute (which is a lot).

The explain plan looks like this:

Limit  (cost=1499607.64..1499608.80 rows=10 width=94)
  ->  Gather Merge  (cost=1499607.64..1499695.14 rows=750 width=94)
        Workers Planned: 2
        ->  Sort  (cost=1498607.61..1498608.55 rows=375 width=94)
              Sort Key: ((ts_rank_cd(directory_network_entity.general_search_vectors, query.query, 32) + ('2'::double precision * ts_rank_cd(directory_network_entity.identifiers_search_vectors, query.query, 32)))) DESC
              ->  Nested Loop  (cost=0.25..1498599.51 rows=375 width=94)
                    Join Filter: (query.query @@ directory_network_entity.identifiers_search_vectors)
                    ->  Parallel Seq Scan on directory_network_entity  (cost=0.00..1496910.08 rows=74908 width=179)
                    ->  Function Scan on to_tsquery query  (cost=0.25..0.26 rows=1 width=32)

My use case is to store around 10M such records and now I'm really concerned whether PostGres is the right choice for this? Or is there anything am I missing in the query or indexing?

1

There are 1 answers

0
Mark Meeus On

As you can see in the visual profile of this plan here https://plansplainer.com/9yU2FI2mtQwuen8sVMp23 most of the cost is spent on the sequental scan of the directory_network_entity table.

A sequential scan means that it is reading every record of the table. That is not what you want, you want postgres to use your index so that it can directly read only the records that match your criteria.

It is not using the index is because you are joining on to_tsquery('0106/30253335') query. So pg reads every directory_network_entity, joins it with your query function, so that is has a row with directory_network_entity and the query. Only then it can filter these rows.

It looks like an optimization PostgreSQL should be able to do, but it can't. (yet)

It seems you can trigger the optimization by using a CTE for the query part.

like this:

explain with query as (
    select  to_tsquery('0106/30253335`') query
)
SELECT (ts_rank_cd(general_search_vectors, query,  32) + ( 2 * ts_rank_cd(identifiers_search_vectors , query,  32))) as rank, 
id, entity_name FROM directory_network_entity , query
WHERE query @@ identifiers_search_vectors order by rank desc limit 10;

on my Postgres 16.1 it shows this plan:

Limit  (cost=13.59..13.59 rows=1 width=584)
  Output: ((ts_rank_cd(directory_network_entity.general_search_vectors, to_tsquery('0106/30253335'::text), 32) + ('2'::double precision * ts_rank_cd(directory_network_entity.identifiers_search_vectors, to_tsquery('0106/30253335'::text), 32)))), directory_network_entity.id, directory_network_entity.entity_name
  ->  Sort  (cost=13.59..13.59 rows=1 width=584)
        Output: ((ts_rank_cd(directory_network_entity.general_search_vectors, to_tsquery('0106/30253335'::text), 32) + ('2'::double precision * ts_rank_cd(directory_network_entity.identifiers_search_vectors, to_tsquery('0106/30253335'::text), 32)))), directory_network_entity.id, directory_network_entity.entity_name
        Sort Key: ((ts_rank_cd(directory_network_entity.general_search_vectors, to_tsquery('0106/30253335'::text), 32) + ('2'::double precision * ts_rank_cd(directory_network_entity.identifiers_search_vectors, to_tsquery('0106/30253335'::text), 32)))) DESC
        ->  Bitmap Heap Scan on public.directory_network_entity  (cost=8.80..13.58 rows=1 width=584)
              Output: (ts_rank_cd(directory_network_entity.general_search_vectors, to_tsquery('0106/30253335'::text), 32) + ('2'::double precision * ts_rank_cd(directory_network_entity.identifiers_search_vectors, to_tsquery('0106/30253335'::text), 32))), directory_network_entity.id, directory_network_entity.entity_name
              Filter: (to_tsquery('0106/30253335'::text) @@ directory_network_entity.identifiers_search_vectors)
              ->  Bitmap Index Scan on gin_identifiers_search_vectors  (cost=0.00..8.80 rows=1 width=0)
                    Index Cond: (directory_network_entity.identifiers_search_vectors @@ to_tsquery('0106/30253335'::text))

Another way to achieve this is to use the same parameter in different places in the query. Also, the join on that to_tsquery is no longer needed.

Should give you the same plan as the query above. I would suggest to use this one, Postgres will know exactly what you want :-)

SELECT (ts_rank_cd(general_search_vectors, to_tsquery('0106/30253335'),  32) + ( 2 * ts_rank_cd(identifiers_search_vectors , to_tsquery('0106/30253335'),  32))) as rank, 
id, entity_name FROM directory_network_entity 
WHERE to_tsquery('0106/30253335') @@ identifiers_search_vectors order by rank desc limit 10;