De-duplicate part of a table on given condition

63 views Asked by At

This is my message table. Now I am trying to write an query to group records by collection_id (only first instance) if type multi-store else by message.id .

                  id                  |            collection_id             |      type      | affiliation_id |    status    | scheduled_for_date 
--------------------------------------+--------------------------------------+----------------+----------------+--------------+--------------------
 1143c066-01ed-4eb5-a146-68487de702a9 | bce85e31-4d2f-43b1-b263-57fca356856f | multi-store    | 12091          | draft        | 
 e1183732-e91d-42eb-9998-110e039cfc25 | bce85e31-4d2f-43b1-b263-57fca356856f | multi-store    | 12092          | draft        | 
 8f962a49-7da1-46d0-87b9-595788767dfe | bce85e31-4d2f-43b1-b263-57fca356856f | multi-store    | 12097          | draft        | 
 6e4dee09-7a4e-47be-bdd8-935a67bb2063 | 740f6b42-bbf1-4aeb-8fe9-6874635d9e29 | multi-store    | 12091          | draft        | 
 79afab0e-14e7-4d1b-9a15-358763743c3e | 740f6b42-bbf1-4aeb-8fe9-6874635d9e29 | multi-store    | 12092          | draft        | 
 7bc78bee-074a-4031-9492-954e7c4eeb09 | 740f6b42-bbf1-4aeb-8fe9-6874635d9e29 | multi-store    | 12097          | draft        | 
 3bb38fbd-d411-4f78-9c42-c858bf57b784 |                                      | standard-store | 10511          | draft        | 
 fbb3b175-1a3b-4515-b0b3-0ce6d6d0145f |                                      | standard-store | 10511          | draft        | 
 84004999-d2cf-4af4-bfaa-c1077d1d8621 |                                      | standard-store | 10511          | sent         | 2017-05-21
 cbea0789-6886-431a-a8e1-723d5aafc7b9 |                                      | standard-store | 10511          | scheduled    | 2019-02-12
 ec8988ff-5136-4b81-b448-cd456dc487a4 |                                      | standard-store | 10511          | review       | 2019-01-13
 0e119440-5fbc-4afe-a784-a6bcfe3a6e4d |                                      | standard-store | 10511          | draft        | 
 98503a20-4396-4809-b3ec-8e330c15afa9 |                                      | standard-store | 10511          | needs_action | 2018-12-11
 d33a9173-dc64-464f-8e58-49b4c9c2fdae |                                      | standard-store | 10511          | draft        | 
 bee0dc72-acca-44e2-82ea-d18e830f91a2 |                                      | standard-store | 10511          | sent         | 2016-03-12

So the output will be like

                  id                  |            collection_id             |      type      | affiliation_id |    status    | scheduled_for_date 
--------------------------------------+--------------------------------------+----------------+----------------+--------------+--------------------
 1143c066-01ed-4eb5-a146-68487de702a9 | bce85e31-4d2f-43b1-b263-57fca356856f | multi-store    | 12091          | draft        | 
 6e4dee09-7a4e-47be-bdd8-935a67bb2063 | 740f6b42-bbf1-4aeb-8fe9-6874635d9e29 | multi-store    | 12091          | draft        | 
 3bb38fbd-d411-4f78-9c42-c858bf57b784 |                                      | standard-store | 10511          | draft        | 
 fbb3b175-1a3b-4515-b0b3-0ce6d6d0145f |                                      | standard-store | 10511          | draft        | 
 84004999-d2cf-4af4-bfaa-c1077d1d8621 |                                      | standard-store | 10511          | sent         | 2017-05-21
 cbea0789-6886-431a-a8e1-723d5aafc7b9 |                                      | standard-store | 10511          | scheduled    | 2019-02-12
 ec8988ff-5136-4b81-b448-cd456dc487a4 |                                      | standard-store | 10511          | review       | 2019-01-13
 0e119440-5fbc-4afe-a784-a6bcfe3a6e4d |                                      | standard-store | 10511          | draft        | 
 98503a20-4396-4809-b3ec-8e330c15afa9 |                                      | standard-store | 10511          | needs_action | 2018-12-11
 d33a9173-dc64-464f-8e58-49b4c9c2fdae |                                      | standard-store | 10511          | draft        | 
 bee0dc72-acca-44e2-82ea-d18e830f91a2 |                                      | standard-store | 10511          | sent         | 2016-03-12

One way to do it is by union:

SELECT DISTINCT ON (collection_id)
    collection_id,
    id,
    type,
    affiliation_id,
    status,
    scheduled_for_date
from
    message
where
    type = 'multi-store'

union

SELECT
    collection_id,
    id,
    type,
    affiliation_id,
    status,
    scheduled_for_date
from
    message
where
    type = 'standard-store'

But I feel that's less efficient, another option could be case based grouping. But that includes complexity when need to add all the select fields to group.

What is the most efficient way to write the query?

2

There are 2 answers

0
Stefanov.sm On

You can use distinct on clause with a case expression.

select distinct on (case when type='multi-store' then collection_id else id end)
  id, collection_id, type,
  affiliation_id, status, scheduled_for_date 
from the_table;

DB-Fiddle demo

Or, assuming that collection_id is NULL for types other than 'multi-store', even simpler

select distinct on (coalesce(collection_id, id)) ...
0
Erwin Brandstetter On

The best solution depends on undisclosed cardinalities.

Assuming relatively few "multi-store" rows, and relatively few rows per distinct collection_id among those, your UNION query looks good - after a couple of tweaks:

(  -- parentheses required
SELECT DISTINCT ON (collection_id)
       id, collection_id, type, affiliation_id, status, scheduled_for_date
FROM   message
WHERE  type = 'multi-store'
ORDER  BY collection_id, affiliation_id  -- ! to pick row with min affiliation_id
)
   
UNION ALL  -- !!!
SELECT id, collection_id, type, affiliation_id, status, scheduled_for_date
FROM   message
WHERE  type = 'standard-store';

fiddle

Use UNION ALL. Just UNION tries to eliminate (non-existent) duplicates, adding pointless cost.

Typically, you want a deterministic, stable pick from each set with the same collection_id. Add a suitable ORDER BY clause accordingly. See:

In this case, a partial index would be of help:

CREATE INDEX ON message (collection_id, affiliation_id) WHERE type = 'multi-store';

The index is used for the small share of "multi-store" rows, the rest is read from a sequential scan.