Can this postgresql query be further optimized?

107 views Asked by At

we're building a translation editor and one of the main use cases is to find similar translations in the database. The main entities are: segment, translation_record and user. Segment can be either source or target segment (text) with translation_record being the connecting entity.

For similarity we use pg_trgm.

We have these indices implemented:

    CREATE INDEX IF NOT EXISTS segment_content_gin ON segments USING gin (content gin_trgm_ops);
    CREATE INDEX IF NOT EXISTS segment_language_id_idx ON segments USING btree (language_id);
    CREATE INDEX IF NOT EXISTS translation_record_language_combination_idx ON translation_records USING btree (language_combination);

This is the query we use (note interpolated values as per ruby language):

set pg_trgm.similarity_threshold TO 0.#{sim_score};
      SELECT SIMILARITY(segments.content, '#{source_for_lookup}') AS similarity,
      translation_records.id AS translation_record_id,
      translation_records.source_segment_id AS source_segment_id,
      segments.content AS source_segment_content,
      translation_records.target_segment_id AS target_segment_id,
      target_segments.content AS target_segment_content,
      creators.username AS created_by_username,
      updaters.username AS updated_by_username,
      translation_records.created_at,
      translation_records.updated_at,
      translation_records.project_name,
      translation_records.import_comment,
      translation_records.style_id,
      translation_records.domain_id,
      segments.language_id AS source_language_id,
      target_segments.language_id AS target_language_id
      FROM segments
      JOIN translation_records
      ON segments.id = translation_records.source_segment_id
      JOIN segments AS target_segments
      ON translation_records.target_segment_id = target_segments.id
      JOIN users AS creators
      ON translation_records.created_by = creators.id
      LEFT JOIN users AS updaters
      ON translation_records.updated_by = updaters.id
      WHERE segments.content % '#{source_for_lookup}'
      AND translation_records.language_combination = '#{lang_lookup_combo}'
      ORDER BY SIMILARITY(segments.content, '#{source_for_lookup}') DESC
      LIMIT #{max_results};

The execution time on my dev laptop on a 4.7 M segments is around 400 ms. My question is, can I further optimize this query by using joins and WHERE differently or by making any other changes?


EDIT: explain (buffers, analyze) with ordering by similarity

"Limit  (cost=59749.56..59750.14 rows=5 width=356) (actual time=458.808..462.364 rows=2 loops=1)"
"  Buffers: shared hit=15821 read=37693"
"  I/O Timings: read=58.698"
"  ->  Gather Merge  (cost=59749.56..59774.99 rows=218 width=356) (actual time=458.806..462.360 rows=2 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        Buffers: shared hit=15821 read=37693"
"        I/O Timings: read=58.698"
"        ->  Sort  (cost=58749.53..58749.81 rows=109 width=356) (actual time=434.602..434.606 rows=1 loops=3)"
"              Sort Key: (similarity(segments.content, 'Coop Himmelb(l)au, Vienna, Austria'::text)) DESC"
"              Sort Method: quicksort  Memory: 25kB"
"              Buffers: shared hit=15821 read=37693"
"              I/O Timings: read=58.698"
"              Worker 0:  Sort Method: quicksort  Memory: 25kB"
"              Worker 1:  Sort Method: quicksort  Memory: 25kB"
"              ->  Hash Left Join  (cost=4326.38..58747.72 rows=109 width=356) (actual time=433.628..434.588 rows=1 loops=3)"
"                    Hash Cond: (translation_records.updated_by = updaters.id)"
"                    Buffers: shared hit=15805 read=37693"
"                    I/O Timings: read=58.698"
"                    ->  Nested Loop  (cost=4309.86..58730.64 rows=109 width=324) (actual time=433.603..434.562 rows=1 loops=3)"
"                          Buffers: shared hit=15803 read=37693"
"                          I/O Timings: read=58.698"
"                          ->  Nested Loop  (cost=4309.70..58727.69 rows=109 width=296) (actual time=433.593..434.551 rows=1 loops=3)"
"                                Buffers: shared hit=15798 read=37693"
"                                I/O Timings: read=58.698"
"                                ->  Hash Join  (cost=4309.27..58658.80 rows=109 width=174) (actual time=433.578..434.535 rows=1 loops=3)"
"                                      Hash Cond: (translation_records.source_segment_id = segments.id)"
"                                      Buffers: shared hit=15789 read=37693"
"                                      I/O Timings: read=58.698"
"                                      ->  Parallel Seq Scan on translation_records  (cost=0.00..51497.78 rows=1086382 width=52) (actual time=0.024..145.197 rows=869773 loops=3)"
"                                            Filter: (language_combination = '2_1'::text)"
"                                            Buffers: shared hit=225 read=37693"
"                                            I/O Timings: read=58.698"
"                                      ->  Hash  (cost=4303.61..4303.61 rows=453 width=126) (actual time=229.792..229.793 rows=2 loops=3)"
"                                            Buckets: 1024  Batches: 1  Memory Usage: 9kB"
"                                            Buffers: shared hit=15558"
"                                            ->  Bitmap Heap Scan on segments  (cost=2575.51..4303.61 rows=453 width=126) (actual time=225.687..229.789 rows=2 loops=3)"
"                                                  Recheck Cond: (content % 'Coop Himmelb(l)au, Vienna, Austria'::text)"
"                                                  Rows Removed by Index Recheck: 63"
"                                                  Heap Blocks: exact=60"
"                                                  Buffers: shared hit=15558"
"                                                  ->  Bitmap Index Scan on segment_content_gin  (cost=0.00..2575.40 rows=453 width=0) (actual time=225.653..225.653 rows=65 loops=3)"
"                                                        Index Cond: (content % 'Coop Himmelb(l)au, Vienna, Austria'::text)"
"                                                        Buffers: shared hit=15378"
"                                ->  Index Scan using segments_pkey on segments target_segments  (cost=0.43..0.63 rows=1 width=126) (actual time=0.019..0.019 rows=1 loops=2)"
"                                      Index Cond: (id = translation_records.target_segment_id)"
"                                      Buffers: shared hit=9"
"                          ->  Memoize  (cost=0.16..0.18 rows=1 width=36) (actual time=0.012..0.013 rows=1 loops=2)"
"                                Cache Key: translation_records.created_by"
"                                Cache Mode: logical"
"                                Hits: 0  Misses: 1  Evictions: 0  Overflows: 0  Memory Usage: 1kB"
"                                Buffers: shared hit=5"
"                                Worker 0:  Hits: 0  Misses: 1  Evictions: 0  Overflows: 0  Memory Usage: 1kB"
"                                ->  Index Scan using users_pkey on users creators  (cost=0.15..0.17 rows=1 width=36) (actual time=0.010..0.010 rows=1 loops=2)"
"                                      Index Cond: (id = translation_records.created_by)"
"                                      Buffers: shared hit=5"
"                    ->  Hash  (cost=12.90..12.90 rows=290 width=36) (actual time=0.014..0.014 rows=12 loops=2)"
"                          Buckets: 1024  Batches: 1  Memory Usage: 9kB"
"                          Buffers: shared hit=2"
"                          ->  Seq Scan on users updaters  (cost=0.00..12.90 rows=290 width=36) (actual time=0.010..0.011 rows=12 loops=2)"
"                                Buffers: shared hit=2"
"Planning:"
"  Buffers: shared hit=28"
"Planning Time: 5.739 ms"
"Execution Time: 462.490 ms"

END EDIT

EDIT explain (buffers, analyze) without ordering

When I remove the ORDER BY line the query slows down. Also I went with GIN index due to % operator.

"Limit  (cost=4310.00..5796.68 rows=5 width=356) (actual time=777.107..780.931 rows=2 loops=1)"
"  Buffers: shared hit=5519 read=37597"
"  I/O Timings: read=55.820"
"  ->  Nested Loop Left Join  (cost=4310.00..81914.70 rows=261 width=356) (actual time=777.105..780.929 rows=2 loops=1)"
"        Buffers: shared hit=5519 read=37597"
"        I/O Timings: read=55.820"
"        ->  Nested Loop  (cost=4309.85..81870.96 rows=261 width=324) (actual time=777.085..780.900 rows=2 loops=1)"
"              Buffers: shared hit=5519 read=37597"
"              I/O Timings: read=55.820"
"              ->  Nested Loop  (cost=4309.70..81827.91 rows=261 width=296) (actual time=777.080..780.892 rows=2 loops=1)"
"                    Buffers: shared hit=5515 read=37597"
"                    I/O Timings: read=55.820"
"                    ->  Hash Join  (cost=4309.27..81662.97 rows=261 width=174) (actual time=777.062..780.869 rows=2 loops=1)"
"                          Hash Cond: (translation_records.source_segment_id = segments.id)"
"                          Buffers: shared hit=5507 read=37597"
"                          I/O Timings: read=55.820"
"                          ->  Seq Scan on translation_records  (cost=0.00..70509.48 rows=2607318 width=52) (actual time=0.019..387.974 rows=2609320 loops=1)"
"                                Filter: (language_combination = '2_1'::text)"
"                                Buffers: shared hit=321 read=37597"
"                                I/O Timings: read=55.820"
"                          ->  Hash  (cost=4303.61..4303.61 rows=453 width=126) (actual time=229.363..229.364 rows=2 loops=1)"
"                                Buckets: 1024  Batches: 1  Memory Usage: 9kB"
"                                Buffers: shared hit=5186"
"                                ->  Bitmap Heap Scan on segments  (cost=2575.51..4303.61 rows=453 width=126) (actual time=225.850..229.358 rows=2 loops=1)"
"                                      Recheck Cond: (content % 'Coop Himmelb(l)au, Vienna, Austria'::text)"
"                                      Rows Removed by Index Recheck: 63"
"                                      Heap Blocks: exact=60"
"                                      Buffers: shared hit=5186"
"                                      ->  Bitmap Index Scan on segment_content_gin  (cost=0.00..2575.40 rows=453 width=0) (actual time=225.817..225.817 rows=65 loops=1)"
"                                            Index Cond: (content % 'Coop Himmelb(l)au, Vienna, Austria'::text)"
"                                            Buffers: shared hit=5126"
"                    ->  Index Scan using segments_pkey on segments target_segments  (cost=0.43..0.63 rows=1 width=126) (actual time=0.008..0.008 rows=1 loops=2)"
"                          Index Cond: (id = translation_records.target_segment_id)"
"                          Buffers: shared hit=8"
"              ->  Index Scan using users_pkey on users creators  (cost=0.15..0.17 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=2)"
"                    Index Cond: (id = translation_records.created_by)"
"                    Buffers: shared hit=4"
"        ->  Index Scan using users_pkey on users updaters  (cost=0.15..0.17 rows=1 width=36) (actual time=0.000..0.000 rows=0 loops=2)"
"              Index Cond: (id = translation_records.updated_by)"
"Planning:"
"  Buffers: shared hit=28"
"Planning Time: 4.569 ms"
"Execution Time: 781.066 ms"

EDIT 2

Based on the first answer and all comments I've created two additional indices: btree on translation_records.source_segment_id and on translation_records.target_segment_id. Then I also switched to GIST index for segments.content as well as used the <-> operator for ordering. The above query actually slowed down to 4.5 seconds.

It should be noted that the searched text has the 1908 row in the DB. Also, the similarity threshold is 0.45. The same query takes between 4.6 seconds for the last row in the DB. The ordering operator doesn't have any effect.

With the GIN index it went back to 407 ms, regardless of the ordering operator. It takes about 6 seconds for the last segment in the DB.

What I have overlooked before is that the similarity threshold has a huge impact on this. By changing it to 0.55 the time drops from 6 seconds to 1.2 seconds for the last row in the DB on my dev laptop.

END EDIT 2

Best, Seba

1

There are 1 answers

0
JGH On

Because you are ordering by similarity, a gist index should be used.

Then, in the order by, use the distance operator (<->) instead of the similarity function, as the former makes use of the gist index.