Delete rows based on age, status and existence of related rows with recursion in PostgreSQL

98 views Asked by At
create table the_table (
  id integer, root_id integer, parent_id integer, status text, ts timestamp, comment text);
insert into the_table values
(1, null, null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, standalone'),
(2, null, null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, root of 3,4'),
(3, 2,    null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, child of 2, parent of 4'),
(4, 2,    3,    'OPEN',     now()-'92d'::interval, '>90 days old, open, child of 2,3'),
(5, null, null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, root of 6,7'),
(6, 5,    null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, child of 5, parent of 4'),
(7, 5,    6,    'COMPLETE', now()-'10d'::interval, '<=90 days old, complete, child of 5,6' )
(8, null, null, 'COMPLETE', now()-'10d'::interval, '<=90 days old, complete, standalone'),
(9, null, null, 'COMPLETE', now()-'10d'::interval, '<=90 days old, complete, root of 10'),
(10,9,    null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, child of 9' ),
(11,11,   null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, parent/child of self'),
(12,null, 12,   'COMPLETE', now()-'92d'::interval, '>90 days old, complete, parent/child of self'),
(13,14,   null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, cross-parent/child of 14'),
(14,13,   null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, cross-parent/child of 13'),
(15,null, null, 'COMPLETE', now()-'92d'::interval, '>90 days old, complete, parent of 16,17'),
(16,null, 15,   'COMPLETE', now()-'92d'::interval, '>90 days old, complete, child of 15'),
(17,null, 15,   'OPEN',     now()-'10d'::interval, '<=90 days old, open, child of 15');

I want to delete all records which are

  • older than 90 days
  • AND in COMPLETE status.
  • AND not related to rows that aren't older than 90 days and COMPLETE, directly or indirectly

Examples:

  • the row with id=1 should be deleted.

  • id=2 and id=3 are COMPLETE and older than 90 days but since id=4 is OPEN and has root_id=2 and parent_id=3, these three rows (id in (2,3,4)) should not be deleted.

  • id=5,id=6 and id=7 are all COMPLETE but id=7 is not older than 90 days and has root_id=5 and parent_id=6, so these three rows (id=5,id=6 and id=7) should not be deleted.

  • id=15 and id=16 are both old enough and COMPLETE but id=17 is still OPEN and younger than 90 days. 16 is not directly related to 17 but they share 15 as parent so these three rows should not be deleted.

I tried as per this answer, but there is one case failing: a removable parent with two children, one non-removable, one removable. (Last test case here fiddle)

While deleting, I would like to keep the entire group until all its members become removable. I tried using recursive CTE, but still I am not able to achieve it. Below is the query I tried:

    WITH recursive RecursiveHierarchy AS (
      SELECT id, root_id, parent_id, status, ts, comment
      FROM the_table
      WHERE status = 'OPEN'
      UNION ALL
      SELECT t.id, t.root_id, t.parent_id, t.status, t.ts, t.comment
      FROM the_table t
      JOIN RecursiveHierarchy r ON (  ( 'OPEN'=t.status
                             OR now()-t.ts <= '90d'::interval   ) 
                        AND (   t.id IN(r.root_id,r.parent_id)
                             OR r.id IN(t.root_id,t.parent_id) ) ) 
    )
    DELETE FROM the_table
    WHERE ts < NOW() - '90 days'::interval
    AND status = 'COMPLETE'
    AND id NOT IN (SELECT id FROM RecursiveHierarchy)
    RETURNING *;

I tried using FUNCTION as well

CREATE OR REPLACE FUNCTION delete_old_records()
RETURNS VOID AS $$
DECLARE
  record_to_delete the_table;
BEGIN
  FOR record_to_delete IN SELECT * FROM the_table
  WHERE ts < NOW() - '90 days'::interval AND status = 'COMPLETE'
  LOOP
    IF NOT EXISTS (
      SELECT 1 FROM the_table
      WHERE id = record_to_delete.id
        OR root_id = record_to_delete.id
        OR parent_id = record_to_delete.id
    ) THEN
      DELETE FROM the_table WHERE id = record_to_delete.id;
    END IF;
  END LOOP;
END;
$$ LANGUAGE plpgsql;
1

There are 1 answers

2
Zegarek On
  1. You forgot the 90-day age in your base CTE query
  2. You're collecting all groups of rows with at least one non-removable by finding all of those, and then looking through their root/parent/child relations, to arrive at complete list of rows to protect. Since you're checking all the criteria in the CTE already, you don't need anything in the final DELETE except a WHERE id NOT IN() against the list of protected non-removables.
  3. Your structure needs to be guarded against cycles/loops at schema level, or you need to guard the CTE against falling into one by employing cycle detection.
WITH recursive RecursiveHierarchy AS (--base query finds all non-removables
  SELECT id, root_id, parent_id
  FROM the_table
  WHERE status = 'OPEN' OR now()-ts <= '90d'::interval
  UNION ALL             --recursion looks for all related to non-removables
  SELECT t.id, t.root_id, t.parent_id
  FROM the_table t
  JOIN RecursiveHierarchy r ON r.id in(t.parent_id,t.root_id)
                            OR t.id in(r.parent_id,r.root_id) 
) CYCLE id SET is_cycle USING path
DELETE FROM the_table
WHERE id NOT IN (SELECT id FROM RecursiveHierarchy
                 WHERE NOT is_cycle)
RETURNING *;
id root_id parent_id status ts comment
1 null null COMPLETE 2023-08-13 11:26:19.770144 >90 days old, complete, standalone
11 11 null COMPLETE 2023-08-13 11:26:19.785285 >90 days old, complete, parent/child of self
12 null 12 COMPLETE 2023-08-13 11:26:19.785285 >90 days old, complete, parent/child of self
13 14 null COMPLETE 2023-08-13 11:26:19.785285 >90 days old, complete, cross-parent/child of 14
14 13 null COMPLETE 2023-08-13 11:26:19.785285 >90 days old, complete, cross-parent/child of 13

Demo at db<>fiddle

You can also take the opposite route: find all immediate removables and search their relations to make sure they aren't linked to something non-removable: demo2

WITH recursive RecursiveHierarchy AS (
  SELECT id as candidate_id, id, root_id, parent_id, true AS is_removable
  FROM the_table
  WHERE status = 'COMPLETE' AND now()-ts > '90d'::interval
  UNION ALL
  SELECT r.candidate_id, t.id, t.root_id, t.parent_id, 
         t.status = 'COMPLETE' AND now()-t.ts > '90d'::interval
  FROM the_table t
  JOIN RecursiveHierarchy r ON (   r.id in(t.parent_id,t.root_id)
                                OR t.id in(r.parent_id,r.root_id) ) 
                            AND t.id NOT IN (r.candidate_id,r.id)
  WHERE NOT is_cycle
) CYCLE id SET is_cycle USING path
,removables AS (
  SELECT candidate_id AS id FROM RecursiveHierarchy 
  GROUP BY candidate_id HAVING bool_and(is_removable)
)
DELETE FROM the_table
WHERE id IN (SELECT id FROM removables)
RETURNING *;

Note that whatever approach you take, this is doomed to be fairly complex because of the recursive nature of the problem. In the worst case, you could have 15k records, each one related to the next, all of which are removable, and only the last one turns out to be non-removable. The entire set would still have to be checked one-by-one, which takes time and memory in this structure. What you can do is change the structure and introduce a group_id/family_id: mark all rows in a family with a common id and make a trigger maintain it. Then, searching for non-removable relation becomes trivial.

If you really have to do complex graph traversal on db, consider pgrouting (example) or apache-age.