I have created and automated a custom workflow for eagerly synchronizing a series of materialized views. After trying several different approaches (for one to many relationships), I have found the most reliable synchronization workflow is to delete all records that may have been impacted and then insert the new records.
DELETE FROM
some_materialized_view
WHERE
set_of_records_key = some_value;
INSERT INTO
some_materialized_view
SELECT
*
FROM
some_query_generating_some_materialized_view;
Note: some_query_generating_some_materialized_view is a complex read operation that takes a non-trivial amount of resources to execute. Additionally, some_materialized_view is heavily indexed with several foreign keys and other constraints.
This feels extremely heavy handed. This workflow comes with excessive delete and insert operations that are often times needless as some of the deleted records may have been identical, or similar enough to be a candidate for an UPDATE.
I would prefer something like the following:
DELETE FROM
some_materialized_view
USING
(
SELECT
unique_key
FROM
some_materialized_view
WHERE
set_of_records_key = some_value
EXCEPT
INSERT INTO
some_materialized_view
SELECT
*
FROM
some_query_generating_some_materialized_view
ON CONFLICT (...) DO UPDATE
SET
foo = EXCLUDED.foo,
bar = EXCLUDED.bar,
...
WHERE
some_materialized_view <> EXCLUDED
RETURNING
unique_key
) AS sub_query
WHERE
some_materialized_view.unique_key = sub_query.unique_key;
The problem is in the ON CONFLICT ... DO UPDATE ... WHERE ... RETURNING
clause.
as addressed in this question: How to use RETURNING with ON CONFLICT in PostgreSQL?
the RETURNING clause only returns impacted records. So records not impacted are not returned, and thus (in the example above) deleted inappropriately.
It seems the only way to get RETURNING
to actually return all records is unnecessarily update identical records by removing the WHERE some_materialized_view <> EXCLUDED
clause, or run some_query_generating_some_materialized_view again in another EXCEPT
clause... both options are also not ideal.
So, what am I missing? Are there other options available? If not, in general, is it preferred to perform a complex, resource intensive, read operation over a needless UPDATE (remembering the associated index maintenance and check constraints)?
Note: I'm not including EXPLAIN ANALYZE
results as this is not specific to a single query, rather a question in general. For the sake of maintainability and sanity this project needs to be consistent and this technique is used several times with tables of different structures and use cases (some read heavy, others write heavy).