I am fairly new to cloud space. As part of our current project, we are trying to create a data lake in Amazon S3 buckets. There would be another S3 layer which would contain CDC happened in previous layer. Talend or Streamsets is what the architecture team is proposing to use. Is there any other way by which CDC can be implemented from S3 to another S3 bucket?
Implementing CDC or Patching CDC is always an important task when pulling data from transactional sources. While objects in S3 are immutable, so S3 doesn't provide anything of its own to merge the change data captured (CDC). There are few ways using which CDC patching can be achieved in S3 or AWS-Data-Lakes.
First, you need to make sure that your pipeline of ETL tool (Stream-sets/NiFi/Sqoop) should be able to fetch the updated transactions/records from the source system(either by using last_modified_date column, etc or by transaction logs) and place it in same s3 diff path or different s3 bucket (CDC-delta).
Now to merge this delta(CDC) into the base-table, you can use either of the approaches mentioned below :