How to use ts_headline() in PostgreSQL while doing efficient full-text search? Comparing two query plans

1.1k views Asked by At

I am experimenting with a full-text search system over my PostgreSQL database, where I am using tsvectors with ts_rank() to pull out relevant items to a user search query. In general this works really fantastic as a simple solution (i.e. no major overhead infrastructure). However, I am finding that the ts_headline() component (which gives users context for the relevance of the search results) is slowing down my queries significantly, by an order of about 10x. I wanted to inquire what is the best way to use ts_headline() without incurring computational expense.

To give an example, here is a very fast tsvector search that does not use ts_headline(). For context, my table has two relevant fields, search_text which has the natural-language text which is being searched against, and search_text_tsv which is a tsvector that is directly queried against (and also used to rank the item). When I use ts_headline(), it references the main search_text field in order to produce a user-readable headline. Further, the column search_text_tsv is indexed using GIN, which provides very fast lookups for @@ websearch_to_tsquery('my query here').

Again, here is query #1:

SELECT
    item_id,
    title,
    author,
    search_text,
    ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv @@ websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20

This gives me 20 top results very fast, running on my laptop about 50ms.

Now, query #2 uses ts_headline() to produce a user-readable headline. I found that this was very slow when it ran against all possible search results, so I used a sub-query to produce the top 20 results and then calculated the ts_headline() only for those top results (as opposed to, say, 1000 possible results).

SELECT *,
    ts_headline(search_text,websearch_to_tsquery(unaccent('my query text here')),'StartSel=<b>,StopSel=</b>,MaxFragments=2,' || 'FragmentDelimiter=...,MaxWords=10,MinWords=1') AS headline
FROM (
SELECT
    item_id,
    title,
    author,
    search_text,
    ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv @@ websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20) as foo

Basically, what this does is limits the # of results (as in the first query), and then uses that as a sub-query, returning all of the columns in the subquery (i.e. *) and also the ts_headline() calculation. However, this is very slow, by an order of magnitude of about 10, coming in at around 800ms on my laptop.

Is there anything I can do to speed up ts_headline()? It seems pretty clear that this is what is slowing down the second query.

For reference, here are the query plans being produced by Postgresql (from EXPLAIN ANALYZE):

Query plan 1: (straight full-text search)

Limit  (cost=56.79..56.79 rows=1 width=270) (actual time=66.118..66.125 rows=20 loops=1)
   ->  Sort  (cost=56.78..56.79 rows=1 width=270) (actual time=66.113..66.120 rows=40 loops=1)
         Sort Key: (ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
         Sort Method: top-N heapsort  Memory: 34kB
         ->  Bitmap Heap Scan on search_index  (cost=52.25..56.77 rows=1 width=270) (actual time=1.070..65.641 rows=462 loops=1)
               Recheck Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
               Heap Blocks: exact=424
               ->  Bitmap Index Scan on idx_fts_search  (cost=0.00..52.25 rows=1 width=0) (actual time=0.966..0.966 rows=462 loops=1)
                     Index Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
 Planning Time: 0.182 ms
 Execution Time: 66.154 ms

Query plan 2: (full text search w/ subquery & ts_headline())

 Subquery Scan on foo  (cost=56.79..57.31 rows=1 width=302) (actual time=116.424..881.617 rows=20 loops=1)
   ->  Limit  (cost=56.79..56.79 rows=1 width=270) (actual time=62.470..62.497 rows=20 loops=1)
         ->  Sort  (cost=56.78..56.79 rows=1 width=270) (actual time=62.466..62.484 rows=40 loops=1)
               Sort Key: (ts_rank(search_index.search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
               Sort Method: top-N heapsort  Memory: 34kB
               ->  Bitmap Heap Scan on search_index  (cost=52.25..56.77 rows=1 width=270) (actual time=2.378..62.151 rows=462 loops=1)
                     Recheck Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
                     Heap Blocks: exact=424
                     ->  Bitmap Index Scan on idx_fts_search  (cost=0.00..52.25 rows=1 width=0) (actual time=2.154..2.154 rows=462 loops=1)
                           Index Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
 Planning Time: 0.350 ms
 Execution Time: 881.702 ms

3

There are 3 answers

0
perelin On

Just encountered exactly the same issue. When collecting a list of search results (20-30 documents) and also getting their ts_headline highlight in the same query the execution time was x10 at least.

To be fair, Postgres documentation is warning about that [1]:

ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care.

I ended up getting the list of documents first and then loading the highlights with ts_headline asynchronously one-by-one. Still slow single queries (>150ms) but better user experience then waiting multiple seconds for an initial load.

[1] https://www.postgresql.org/docs/15/textsearch-controls.html#TEXTSEARCH-HEADLINE

1
Dan K. On

I think I can buy you a few more milliseconds. In your query, you're returning "SELECT *, ts_headline" which includes the full original document search_text in the return. When I limited my SELECT to everything but the "search_text" from the subquery (+ ts_headline as headline), my queries dropped from 500-800ms to 100-400ms. I'm also using AWS RDS so that might play a role on my end.

0
Jeff Vermeer On

I have encountered this issue in a few different contexts, and, while working on a method for improving the highlighting pf phrases and the non-highlighting of partial matches, derived a method for doing highlighted content retrieval ~10X faster than ts_headline. The method requires one have a pre-realized TSVector lookup column AND a pre-processed TEXT[] of words in a document, but if you are willing to pay that price, the speed of delivery is remarkably improved in early testing.

See: https://github.com/thevermeer/pg_ts_semantic_headline?tab=readme-ov-file#how-to-implement-content-retrieval-that-is-5x-10x-faster-than-ts_headline