Postgres trigram search is slow

1.2k views Asked by At

I have a table with around 3 million rows. I created single gin index on multiple columns of the table.

CREATE INDEX search_idx ON customer USING gin (name gin_trgm_ops, id gin_trgm_ops, data gin_trgm_ops)

I am running following query (simplified to use single column in criteria) but it takes around 4 seconds:

EXPLAIN ANALYSE
SELECT c.id, similarity(c.name, 'john') sml
FROM customer c WHERE
c.name % 'john'
ORDER BY sml DESC
LIMIT 10

The output query plan is:

Limit (cost=9255.12..9255.14 rows=10 width=30) (actual time=3771.661..3771.665 rows=10 loops=1)
  -> Sort (cost=9255.12..9260.43 rows=2126 width=30) (actual time=3771.659..3771.661 rows=10 loops=1)
       Sort Key: (similarity((name)::text, 'john'::text)) DESC
       Sort Method: top-N heapsort Memory: 26kB
       -> Bitmap Heap Scan on customer c (cost=1140.48..9209.18 rows=2126 width=30) (actual time=140.665..3770.478 rows=3605 loops=1)
            Recheck Cond: ((name)::text % 'john'::text)
            Rows Removed by Index Recheck: 939598
            Heap Blocks: exact=158055 lossy=132577
            -> Bitmap Index Scan on search_idx (cost=0.00..1139.95 rows=2126 width=0) (actual time=105.609..105.610 rows=458131 loops=1)
                 Index Cond: ((name)::text % 'john'::text)
Planning Time: 0.102 ms

I fail to understand that why the rows are not SORTED and LIMITed to 10 when retrieved from the search_idx in first step, and afterwards only 10 rows are fetched from the customer table (instead of 2126 rows)

Any ideas how can this query be made faster. I tried gist index but i see no performance gains. I also tried increasing the work_mem from 4MB to 32MB and I can see improvement of 1 sec but not more. Also I noticed that even if I remove c.id in SELECT clause, postgres does not perform index only scan and still joins with main table.

Thanks for the help.

Update1: After Laurenz Albe suggestion below the query performance increased and it is now around 600 ms. Plan looks like this now:

Subquery Scan on q  (cost=0.41..78.29 rows=1 width=12) (actual time=63.150..574.536 rows=10 loops=1)
  Filter: ((q.name)::text % 'john'::text)
  ->  Limit  (cost=0.41..78.16 rows=10 width=40) (actual time=63.148..574.518 rows=10 loops=1)
        ->  Index Scan using search_name_idx on customer c  (cost=0.41..2232864.76 rows=287182 width=40) (actual time=63.146..574.513 rows=10 loops=1)
              Order By: ((name)::text <-> 'john'::text)
Planning Time: 42.671 ms
Execution Time: 585.554 ms
2

There are 2 answers

6
Laurenz Albe On

To get the 10 closest matches with index support, you should create a GiST index and query like this:

SELECT id, sml
FROM (SELECT c.id,
             c.name,
             similarity(c.name, 'john') sml
      FROM customer c
      ORDER BY c.name <-> 'john'
      LIMIT 10) AS q
WHERE name % 'john';

The subquery can use the GiST index, and the outer query eliminates all results that do not exceed the pg_trgm.similarity_threshold.

0
jzqa On

UPDATE: Referring https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search.html

I updated the query to use word_similarity with multicolumn GIST index and it improved performance alot.

EXPLAIN (analyze, buffers)
WITH input AS (SELECT 'new york' AS search)
SELECT *
FROM customer, input
WHERE
(
    input.search <% id
    OR input.search <% name
    OR input.search <% city
)
ORDER BY
input.search <<-> id,
input.search <<-> name,
input.search <<-> city
LIMIT 10

Explain plan showsindex only scan.