This is a problem that comes up often in my work with various different data sets, so please excuse me presenting it in general terms rather than using a specific example.
I often need to get records from a large table (generally 100s of millions of rows) where a text column is similar to a column in a much smaller table (10s to 100s of rows). My current approach is as follows, where targets
is the smaller table and matches
the larger one.
set pg_trgm.similarity_threshold = .9;
select *
from targets as t
inner join matches as m on
t.name % m.name;
matches.name
will have a GIN index and will generally have a relatively high degree of uniqueness, with perhaps 10-20% of records being duplicates. Both matches.name
and targets.name
are almost always less than 50 characters long, and often much shorter.
As I understand it, this is a slightly unusual use-case: the Postgres documentation and most SO answers seem to focus on optimising for matching against a single value. So I'd be keen to hear thoughts on two questions:
- In very general terms (tens of minutes, hours, etc.), and assuming the database is configured optimally, what's a reasonable aim for this type of query in terms of performance, given, say, 300 targets and 300 million potential matches?
- Is the strategy I'm using at the moment the most efficient one given the parameters? Would it be worth trying a GiST index and taking the top n matches for each row using the
<->
operator instead, for example? Are there completely different approaches that could be more efficient?
Thanks in advance for your help!
There is no bonus for bulk operations of this nature. They don't say anything about doing it more than once because there is nothing much to say. Doing it 300 times (rows in t) is about 300 times as expensive as doing it for one row in t.
This will depend on histogram of trigram frequencies, so it can make a big difference if these are street addresses or English phrases or serial numbers/part numbers or what. As a rough estimate I would say that (at threshold of 0.9, it gets much worse as that decreases) you are looking at 30 seconds to a minute per row in t.
I would expect a substantial degradation for use of GiST rather than GIN.
A more efficient way would be hand-code something in C that doesn't have to deal with transactions, mutability, concurrency, abstract data types, etc. There are also some improvements that could probably be made if we had statistical estimates of the frequency of each trigram in the giant table, but I don't think that would be very feasible for an extension within the current PostgreSQL infrastructure.