How should one track changes to a dynamic table in Redshift?

837 views Asked by At

We have an ETL service provider that loads data into our Redshift instance every 3 hours from our CRM. We would like to have insights into the changes that are being made with every sync, or at least on a week to week basis.

My first instinct is to have a lambda triggered daily that duplicates the entire table into an archive,

e.g. CREATE TABLE crm.leads_YYYY_MM_DD FROM (SELECT * FROM crm.leads)

and building some reports around the field-level diffs between tables. I am wondering if there is a more intelligent approach. This seems like it would be difficult create a history from, and that an enormous amount of unchanged data might be being stored unnecessarily. What is a better strategy?

1

There are 1 answers

0
John Rotenstein On

The only way to avoid "an enormous amount of unchanged data" would be to only store "old" data when "new" data is changed. This will complicate the ETL process because it needs to effectively extract data before it is added/updated by the ETL process.

The process would generally be:

  • Load a staging table with the 'new' data from your service provider
  • For each unique identifier in the 'new' data, copy 'old' data to a history table, together with a timestamp
  • Upsert the 'new' data into the main data table

This would result in the history table containing the 'old' data for anything that has been changed. However, it wouldn't assist with identifying "new" data. For that, you'd need to add a timestamp onto the main table to identify when the row was added.

You should start by thinking about what output you'd like from this whole process. For example:

  • Think about what you would like to see for data that added (new values)
  • Think about what you would like to see for data that changed (old values + new values)