Implementing CDC in Amazon S3

4.8k views Asked by At

I am fairly new to cloud space. As part of our current project, we are trying to create a data lake in Amazon S3 buckets. There would be another S3 layer which would contain CDC happened in previous layer. Talend or Streamsets is what the architecture team is proposing to use. Is there any other way by which CDC can be implemented from S3 to another S3 bucket?

2

There are 2 answers

0
Nitesh Saxena On BEST ANSWER

Implementing CDC or Patching CDC is always an important task when pulling data from transactional sources. While objects in S3 are immutable, so S3 doesn't provide anything of its own to merge the change data captured (CDC). There are few ways using which CDC patching can be achieved in S3 or AWS-Data-Lakes.

First, you need to make sure that your pipeline of ETL tool (Stream-sets/NiFi/Sqoop) should be able to fetch the updated transactions/records from the source system(either by using last_modified_date column, etc or by transaction logs) and place it in same s3 diff path or different s3 bucket (CDC-delta).

Now to merge this delta(CDC) into the base-table, you can use either of the approaches mentioned below :

  1. If you are using AWS EMR, or Spark in your environment, I would recommend to use Apache-hudi. Open source now, but this was designed by Uber earlier for providing the facility of the transactional tables in data-lakes. It has power to merge the CDC patch in base data even in real-time scenarios, which later might save your efforts to implement lambda-architecture in your data-lake. Refer this link - https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/
  2. Recently one amazing feature launched by data bricks is Delta Lakes. This approach of using delta lakes is really great and give you an out of the box performance. Delta lakes provide the functionality of ACID transactions to your data-lake and give better performance in both streaming and batch scenarios. Please refer these links, where the delta-lake has been implemented with AWS DMS and S3. https://databricks.com/blog/2019/07/15/migrating-transactional-data-to-a-delta-lake-using-aws-dms.html https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
  3. One another way is, you can write your own custom spark-jobs to do this functionality like explained in the below link, but this is a slow and costly operation if your dataset is large, and you might need some other technique in case of real-time CDC patching. Refer the link - change data capture in spark
1
SwapSays On

You have to use a ETL/ELT tool to capture CDC. There is no way (as per my knowledge) S3 can handle that on its own.

However, you could also consider AWS Glue or Matillion as they are native to AWS and hence compatibility might be better than Talend (P.S. I haven't used Talend)