Returning old row values in UPSERT

263 views Asked by At

I have the following table in Postgres 14:

CREATE TABLE mytable (
  id            text PRIMARY KEY
, top           bigint  NOT NULL
, top_timestamp bigint  NOT NULL
);

I would like to upsert top and top_timestamp while returning the old values (if any). If the row already exists, I want to update it only if the new value of top is > old value.

The solution also HAS to account for concurrent writes.

For the upsert, I have this so far:

INSERT INTO mytable
AS mt (id, top, top_timestamp)
VALUES ('some-id', 123, 999)
ON CONFLICT (id)
DO UPDATE SET top=123, top_timestamp=999
WHERE mt.top < 123

I realise that RETURNING can only refer to new values, so to return the old values, I have this:

WITH old_values AS (
   SELECT top, top_timestamp FROM mytable
   WHERE id = 'some-id' FOR UPDATE
   )
, upd AS ($upsertSqlAbove)
SELECT top AS old_top, top_timestamp AS old_top_timestamp
FROM old_values;

This seems to work, but is it safe for concurrent writes? Does SELECT FOR UPDATE work as I expect here? Namely, it will lock the row until the whole query is finished?

What is the difference between that first query and this:

INSERT INTO mytable
AS mt (id, top, top_timestamp)
VALUES ('some-id', 123, 999)
ON CONFLICT (id)
DO UPDATE SET top=123, top_timestamp=999
WHERE mt.top < 123
RETURNING (SELECT top FROM mytable WHERE id = 123) AS last_top, 
          (SELECT top_timestamp FROM mytable WHERE id = 123) AS last_top_timestamp

Here is a fiddle to test.

1

There are 1 answers

5
Erwin Brandstetter On BEST ANSWER

What's the issue?

You ask:

What is the difference between that first query and this:

Either reports old values from the last row version that has just been updated in a concurrency-safe manner. But the first one is more expensive. It locks an existing row earlier. So it holds the lock for longer and possibly blocks concurrent writes to the same row for longer. It also executes one more statement. Most importantly, it locks the row even when no later UPDATE happens, which is a complete waste.

So the 2nd query is better. Except for two flaws:

  1. It runs a separate SELECT for every column. That would seem necessary since a subquery expression in the RETURNING clause can only return a single value.

  2. It runs those SELECT queries even after an INSERT. Then the SELECT is guaranteed to come up empty - at full cost.

Return the old row only if updated

Unfortunately, the UPDATE part of an UPSERT does not allow additional tables in a FROM clause - unlike regular UPDATE. See:

So your attempts are well-founded.

This query fixes both flaws (tested in Postgres 14 & 16):

INSERT INTO tbl AS t (id, top, top_timestamp)
VALUES ('some-id', 123, 999)
ON     CONFLICT (id) DO UPDATE
SET    top = EXCLUDED.top                      -- !
     , top_timestamp = EXCLUDED.top_timestamp  -- !
WHERE  t.top < EXCLUDED.top                    -- !
RETURNING (SELECT t_old FROM tbl t_old
           WHERE  t_old.id = t.id
           AND    t.xmax <> 0                  -- actually an UPDATE!
          ).*                                  -- !!!

fiddle

(To clarify for the general audience: The subquery in the RETURNING clause operates on the same snapshot of the table and does not see any row inserted or updated in the same statement, yet. It sees the old row version.)

The test t.xmax <> 0 filters actual updates. See:

And we can circumvent the discussed restriction for subqueries by returning a single (well-known) row type, and then de-composing it - with ( SELECT ... ).*.

Plus, use the special row variable EXCLUDED. It holds the row proposed for insert, and helps to avoid spelling out input values repeatedly. (Subtle difference: all default values and triggers have already been applied.)

This returns the whole row. To only return selected columns, what the UPSERT in a CTE and decompose later:

WITH ups AS (
   INSERT INTO tbl AS t (id, top, top_timestamp)
   VALUES ('some-id', 124, 1000)
   ON     CONFLICT (id) DO UPDATE
   SET    top = EXCLUDED.top
        , top_timestamp = EXCLUDED.top_timestamp
   WHERE  t.top < EXCLUDED.top
   RETURNING (SELECT t_old FROM tbl t_old
              WHERE  t_old.id = t.id
              AND    t.xmax <> 0
             ) AS t_old
   )
SELECT id, top , top_timestamp  -- just the columns you want
FROM  (SELECT (t_old).* FROM ups) sub;

Note how I decompose the row type in a subquery once. This is to avoid repeated evaluation. See:

Return the old row in any case

According to your comment, you even want the old row back if no update happens. If the WHERE condition is not true, no update happens and RETURNING does not return any rows. That's an even trickier combination of SELECT, INSERT and UPDATE (UPSERT). Above query won't cut it. Your CTE solution would work. Or an empty update like used here:

Simple, but wasteful - as explained in my answer to the same question:

Combine everything we have learned so far with this:

To arrive at this optimized function for single-row UPSERT:

CREATE OR REPLACE FUNCTION f_upsert_tbl(_id text, _top bigint, _top_timestamp bigint, OUT old_row tbl)
  LANGUAGE plpgsql AS
$func$
BEGIN
   LOOP
      SELECT *
      FROM   tbl
      WHERE  id = _id
      AND    top >= _top   -- new value isn't bigger
      INTO   old_row;

      EXIT WHEN FOUND;

      INSERT INTO tbl AS t
             ( id,  top,  top_timestamp)
      VALUES (_id, _top, _top_timestamp)
      ON     CONFLICT (id) DO UPDATE
      SET    top = EXCLUDED.top
           , top_timestamp = EXCLUDED.top_timestamp
      WHERE  t.top < EXCLUDED.top                    -- new value is bigger
      RETURNING (SELECT t_old FROM tbl t_old
                 WHERE  t_old.id = t.id
                 AND    t.xmax <> 0                  -- actually was an UPDATE!
                ).*
      INTO   old_row;

      EXIT WHEN FOUND;
   END LOOP;
END
$func$;

Call:

SELECT * FROM f_upsert_tbl('some-id', 123, 999);

Or:

SELECT id, top FROM f_upsert_tbl('new-id', 3, 3);

fiddle

OUT old_row tbl uses the registered row type of the table as "OUT" parameter. Replace with your actual (schema-qualified) table name. Obviously, adapt all column names and types to your actual table definition.

Should be the perfect solution. Keeps locks to a minimum (doesn't write-lock at all, when just returning an old, unchanged row), and handles every corner case concurrent writes might produce in default READ COMMITTED isolation level. The call becomes super simple and short as demonstrated.

Possible downsides:

  • Creates a dependency on the table type. If you modify table columns you may have to recreate the function. You can't drop the table unless you add CASCADE to the DROP command.
  • Not for multi-row UPSERT.