I got records from two different sources and the goal is to create link between sources that match each other. For that purpose there is applied trained AI model which assigns score of probability of match for each record from source A to each record to source B. The table score
then looks like this.
src_a_id src_b_id score
-----------------------------
1 foo 0.8
1 bar 0.7
1 baz 0.6
2 foo 0.9
2 bar 0.5
2 baz 0.3
Now I need to read from this table what is the most likely counter match to src_a
record with id 1
. When you select data with sql SELECT * FROM score WHERE src_a_id = 1 ORDER BY score DESC;
you will get this result.
src_a_id src_b_id score
-----------------------------
1 foo 0.8
1 bar 0.7
1 baz 0.6
Here it looks like the first row is result I am looking for and so the counter match is src_b
record with id foo
with mutual score 0.8
but it is not correct. We can query from the other side to verify result. What is counter match to src_b
with id foo
? Using sql SELECT * FROM score WHERE src_b_id = 'foo' ORDER BY score DESC;
we get result:
src_a_id src_b_id score
-----------------------------
2 foo 0.9
1 foo 0.8
From the first query it has looked like src_a
id 1
matches src_b
id foo
.
From the second query it is clear that previous conclusion is wrong because src_b
id foo
matches to src_a
id 2
because this pair has higher mutual score.
How should I write query to find the match to src_a
record with id 1
considering that table will have thousands hundred records?
My first steps were searching for some recursive queries in Postgres but tutorials I have found didn't fit to my use case and honestly I am quite failing make up any working application so far.
EDIT
For the demonstration syntax of the creating testing data:
CREATE TABLE score (
src_a_id integer NOT NULL,
src_b_id varchar(255) NOT NULL,
score decimal(3,2) NOT NULL
);
INSERT INTO score (src_a_id, src_b_id, score)
VALUES
(1, 'foo', 0.8),
(1, 'bar', 0.7),
(1, 'baz', 0.6),
(2, 'foo', 0.9),
(2, 'bar', 0.5),
(2, 'baz', 0.3);
From the testing data it can be derived there exist two pairs.
1
matchesbar
2
matchesfoo
baz
doesn't have match
How can I query for src_a id 1
match? Expected result is src_b id bar
. And from the other side. How can I query for src_b id bar
match? Expected result is src_a id 1
.
It seems likely that your problem can be solved w/o recursion by using a window function
row_number() over(<partition>)
. What you want is to find such pairs where the score is maximal for each id individually.Given the sample dataset you provided - we can write this CTE where we have 2 row numbers (one for each id) and then sum them to get the rank of a pair:
With that you'd get this result:
And now you can pick pairs where the
pair_rank
is minimalGiven no the query will result in all pairs where the score is the highest
EDIT: There are several edge cases where the above approach yields ambiguous/incorrect result:
src_A_id
have lower score then any other pair sharing the samesrc_B_id
(should the query return null/0 rows/highest amongst allsrc_A_id
?)src_B_id
(which one wins over the other given the src_A_id are different?)src_A_id
give the highest score for the samesrc_B_id
(again, which one wins over the other given both src_A_id and src_B_id are the same?)Given the below dataset, you can observe all 3 cases:
And here is the adapted script, but you can adjust according to the desired behavior:
Which yield:
null
if all pairs are taken (examplesrc_a_id
= 1)src_b_id
(examplesrc_a_id
= 2)src_a_id
(examplesrc_a_id
= 3)