I've been trying to find online documentation or blogs on approaches to validate end-to-end CDC capturing completeness, aka "Data Reconciliation". At my company we are using both Debezium for PG and Mongo to capture change streams and replicate them to our Snowflake DWH via Kafka. Are there specific techniques to be able to verify that a WAL or oplog maps 100% to captured events? Maybe by exposing primitives to count / checksum WAL / oplog operations as metrics / metadata fields to be compared against change event counts? While there are a couple of offerings that purport to help with this (e.g. BryteFlow, Redgate) I'm curious to learn if there are custom or open source approaches to this problem, and if there are any online resources I may have missed.
As an aside, I'm quite surprised that this is discussed more in blogs and on the web given how crucial it is in having confidence in replication streams. I've only had limited success, only finding the following resources:
- https://sirupsen.com/napkin/problem-14-using-checksums-to-verify
- https://www.guru99.com/what-is-data-reconciliation.html
- https://blog.metamirror.io/cdc-drift-and-reconciliation-6cc524aa8c28
- https://shopify.engineering/capturing-every-change-shopify-sharded-monolith
- https://aws.amazon.com/blogs/big-data/build-a-distributed-big-data-reconciliation-engine-using-amazon-emr-and-amazon-athena/