Fulltext search combined with fuzzysearch in PostgreSQL

8.2k views Asked by At

I want to realize a fulltext search in postgresql combined with a kind of fuzzy search. For my testarea I followed up this article: https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/ and everything is working fine. But sometimes I have search cases with and without spaces in the searach string like this:

In my 'title'-column there is an entry like 'test123'. My searchstring looks like 'test 123' with a space in it. How can I get a hit in this testcase?

My search-sql-query looks like:

SELECT * 
FROM test, plainto_tsquery('test:*&123:*') as q 
WHERE (tsv @@ q)

result: 0 rows

So I tried to figure out if I can use pg_trgm combined with ts_vector but I can not find a solution. Do you have an idea?

1

There are 1 answers

1
Mani Gandham On

As the documentation on parsing states:

...plainto_tsquery will not recognize tsquery operators, weight labels, or prefix-match labels in its input...

plainto_tsquery and phraseto_tsquery are convenience functions which make it easier to search by a full string, but they don't support all of the features. Use to_tsquery instead which accepts the full search syntax:

SELECT * 
FROM test, to_tsquery('testing:* & 123:*') as q 
WHERE (tsv @@ q)

This function also requires you to normalize the search query in the same way you normalize the text you're searching by using to_tsvector, but that's pretty easy with some string functions:

SELECT string_agg(lexeme || ':*', ' & ' order by positions) 
FROM unnest(to_tsvector('testing 123'))

This basically gets the individual tokens from to_tsvector, appends :* to each, then joins them with & to create a single string. The example above takes testing 123 and produces testing:* & 123:* which you can then use directly with to_tsquery to get fuzzy matching with the normalization intact.

You can combine it all together into a CTE to make it simple:

WITH search AS (
    SELECT to_tsquery(string_agg(lexeme || ':*', ' & ' order by positions)) AS query
    FROM unnest(to_tsvector('enter your search query here'))
)
SELECT test.*
FROM test, search
WHERE (test.tsv @@ search.query)

This assumes that the table has a tsv column of datatype tsquery which is pregenerated instead of creating it on every query (which is much slower). PG12+ supports generated columns which can keep this updated automatically.