PostgreSQL multiple pg_trgm similarity score sub-query

1.1k views Asked by At

I'm fairly new to SQL and I'm struggling with a sub query. I've got a table that looks like this:

      sss        | mm | sid 
------------------+----+-----
IBM LTD          |    | 003
I.B.M.           |    | 003
A.BM LTD         |    | 004
IMB LTD          |    | 004
IMB UK           |    | 005
IBMUK LTD        |    | 006
IBMUKLTD         |    | 007
IBM LIMITED      |    | 008
IBM U.K.         |    | 008
IBM U.K. LIMITED |    | 009
I.B.M UK LTD     | 1  | 001
IBM              | 1  | 001
IBM UK           | 1  | 001
IBM UK LTD       | 1  | 001

The rows with a 1 in the mm column are rows that have been manually matched and the ID is correct. I want to use the similarity function to try and find close matches to the records that have already been matched manually.

I know that normally you put a sub-query in the from statement but I can't do that with the similarity function, but I'm not sure how to do it, this is what I'd like to do:

SELECT 
    sss, 
    similarity(sss, select(sss from tusm where mm = 1) as match_score)
from 
    tusm 
where 
    mm is null and
    where match_score >= 0.5
;
1

There are 1 answers

1
jjanes On BEST ANSWER

You have a few problems here. You can't use column aliases in the where clause, you are trying to assign a column alias on the wrong side of the parenthesis, you can't feed a set to the second argument of similarity, and you've just generally mangled the syntax in several places.

To overcome the inability of similarity to deal with sets, do a self join:

select a.sss as a_sss, b.sss as b_sss, similarity(a.sss, b.sss) as score from
tusm a, tusm b 
where a.mm is null and b.mm = 1 and 
similarity(a.sss, b.sss)  > 0.5;

giving

   a_sss   |   b_sss    |  score
-----------+------------+----------
 IBM LTD   | IBM UK LTD | 0.727273
 IBMUK LTD | IBM UK LTD | 0.615385

If the table gets large, it will be hard to make this perform well.