We have an ETL service provider that loads data into our Redshift instance every 3 hours from our CRM. We would like to have insights into the changes that are being made with every sync, or at least on a week to week basis.
My first instinct is to have a lambda triggered daily that duplicates the entire table into an archive,
e.g. CREATE TABLE crm.leads_YYYY_MM_DD FROM (SELECT * FROM crm.leads)
and building some reports around the field-level diffs between tables. I am wondering if there is a more intelligent approach. This seems like it would be difficult create a history from, and that an enormous amount of unchanged data might be being stored unnecessarily. What is a better strategy?
The only way to avoid "an enormous amount of unchanged data" would be to only store "old" data when "new" data is changed. This will complicate the ETL process because it needs to effectively extract data before it is added/updated by the ETL process.
The process would generally be:
This would result in the history table containing the 'old' data for anything that has been changed. However, it wouldn't assist with identifying "new" data. For that, you'd need to add a timestamp onto the main table to identify when the row was added.
You should start by thinking about what output you'd like from this whole process. For example: