I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery()
Example
> SELECT to_tsvector('mortgag') @@ to_tsquery('simple', 'mortgage')
?column?
----------
f
(1 row)
> SELECT to_tsvector('mortgag') @@ to_tsquery('english', 'mortgage')
?column?
----------
t
(1 row)
I would think they should both return true, but obviously the first does not - why?
The FTS utilizes dictionaries to normalize the text:
So dictionaries are used to throw out things that are too common or meaningless to consider in a search (stop words) and to normalize everything else so city and cities, for example, will match even though they're different words.
Let us look at some output from
ts_debug
and see what's going on with the dictionaries:Notice that
simple
uses thesimple
dictionary whereasenglish
uses theenglish_stem
dictionary.The
simple
dictionary:The
simple
dictionary just throws out stop words, downcases, and that's about it. We can see its simplicity ourselves:The
simple
dictionary is too simple to even handle simple plurals.So what is this
english_stem
dictionary all about? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. From the fine manual:And just below that we see the
english_stem
dictionary:So the
english_stem
dictionary stems words and we can see that happen:Executive Summary:
'simple'
implies simple minded literal matching,'english'
applies stemming to (hopefully) produce better matching. The stemming turns mortgage into mortgag and that gives you your match.