Fast string search in PostgreSQL table with slightly erroneous input

45 views Asked by At

Assume I have a user table in PostgreSQL with columns

first_name (PK)
last_name (PK)
email

There are now millions of users in it. One user has the record

(John, Smith, [email protected])

Now I search for him and enter erroneously Johny Smit.

How can I still find the record and this really quick? Is that possible with sqlalchemy as well?

1

There are 1 answers

5
Zegarek On BEST ANSWER

You can use trigram-based indexing and search included in the pg_trgm extension: demo

create extension pg_trgm;
create index trgm_idx on my_table using GiST ( first_name gist_trgm_ops
                                              ,last_name  gist_trgm_ops);
select * from my_table 
where    first_name % 'Johny' 
  and    last_name  % 'Smit' 
order by last_name <->'Smit'
        ,first_name<->'Johny'
limit 5;
first_name last_name email
John Smith [email protected]
QUERY PLAN
Limit (cost=0.28..8.31 rows=1 width=119) (actual time=19.341..19.353 rows=1 loops=1)
  Output: first_name, last_name, email, ((first_name <-> 'Johny'::text)), ((last_name <-> 'Smit'::text))
  -> Index Scan using trgm_idx on public.my_table (cost=0.28..8.31 rows=1 width=119) (actual time=19.337..19.348 rows=1 loops=1)
        Output: first_name, last_name, email, (first_name <-> 'Johny'::text), (last_name <-> 'Smit'::text)
        Index Cond: ((my_table.first_name % 'Johny'::text) AND (my_table.last_name % 'Smit'::text))
        Order By: ((my_table.first_name <-> 'Johny'::text) AND (my_table.last_name <-> 'Smit'::text))
Planning Time: 1.364 ms
Execution Time: 19.416 ms
  1. In a real-life scenario it won't be as fast as this because I just buried John Smith under a pile of 100k random uuids.
  2. You will have to tweak your similarity target and still probably search for a number of best matches and somehow pick out of those - there could be many people with names that also match your search.
  3. You might want use an expression index that merges the two fields so that you can deal with just one search phrase and one resulting similarity metric, instead of having to resolve two.
drop index trgm_idx ;
create index trgm_idx2 on my_table 
   using GiST ((first_name||' '||last_name) gist_trgm_ops);
prepare find_john_smith_using_pg_trgm(text) as 
select *,$1 as search_phrase
        , (first_name||' '||last_name)<->  $1 as "<->"
        , (first_name||' '||last_name)<<-> $1 as "<<->"
        , (first_name||' '||last_name)<<<->$1 as "<<<->"
        , word_similarity(first_name||' '||last_name,$1)
        , strict_word_similarity(first_name||' '||last_name,$1)
from my_table 
where    (first_name||' '||last_name) % $1
order by (first_name||' '||last_name)<->$1
limit 5;

execute find_john_smith_using_pg_trgm('Johny Smit');
first_name last_name email search_phrase <-> <<-> <<<-> word_similarity strict_word_similarity
John Smith [email protected] Johny Smit 0.4285714 0.38461536 0.4285714 0.61538464 0.5714286
Johan Smittson [email protected] Johny Smit 0.6315789 0.6111111 0.6315789 0.3888889 0.36842105
execute find_john_smith_using_pg_trgm('Johny');
first_name last_name email search_phrase <-> <<-> <<<-> word_similarity strict_word_similarity
John Smith [email protected] Johny 0.6923077 0.6363636 0.6923077 0.36363637 0.30769232
execute find_john_smith_using_pg_trgm('Smitt');
first_name last_name email search_phrase <-> <<-> <<<-> word_similarity strict_word_similarity
Johan Smittson [email protected] Smitt 0.6875 0.6666666 0.6875 0.33333334 0.3125
John Smith [email protected] Smitt 0.6923077 0.6363636 0.6923077 0.36363637 0.30769232
  1. If you do keep them separate, you can establish priorities, and for example order by a weighted average of the two - it's safe to assume the match on last_name has some priority over the match on first_name, especially since last names aren't altered as often (diminutives, abbreviation, nicknames, second names, etc.).