ETL pipeline in AWS with s3 as datalake how to handle incremental updates

3.5k views Asked by At

I have setup ETL pipeline in AWS as follows

input_rawdata -> s3 -> lambda -> trigger spark etl script (via aws glue )-> output(s3,parquet files )

My question is lets assume the above is initial load of the data ,how do I setup to run incremental batches that come every day(or every hour) which add new rows or update existing records

a.) how do I keep appending to the same s3 parquet files. so that subsequent presto db queries result in latest data.

b.) how to handle duplicate records get the latest timestamp for the queries.

In the spark script , do I need to create Hive external table with source as s3 and use in presto db ?

thanks for any inputs.

5

There are 5 answers

0
Ashan On

First of all don't try to append to the files exist in s3, rather create files with multiple records.

To query s3 for analytics you can firat describe the data using AWS Athena where its data catalog is Hive metastore compatible.

To remove duplicates you can write SQL like query through Athena to query unique record sets.

0
Oron Zimmer On

Since you are already using Lambda and Glue you could use Kinesis and KCL to capture your data as a stream, or use Spark Streaming, as you have a spark script. Either of these options would give you the incremental input you are looking for. There is less risk of data corruption when using real-time streaming into an existing database.

Then you can pass the data through your ETL process with Glue. You can schedule or chain your ETL jobs in Glue and it can load your transformed data into an AWS bucket. Glue is batch oriented, but the minimal interval is 5 minutes, and with Kinesis performing the first steps and then passing off the completed data to Glue you can still have incremental updates. You can review this to get additional resources and ideas on ETL architecture.

For any duplicate data you can run a SQL-ish query on your completed datasets, if necessary.

0
Pierre On

Apache Hudi would be a great tool for that: https://hudi.incubator.apache.org/ You can do upserts to tables with data stored in S3 in Parquet format and Presto is compatible with it. For example with EMR 5.28 Hudi is installed already and you can query Hudi datasets with Hive, Spark and Presto.

0
FraDel On

You can now use Delta to make upserts, appends and deltas on your data using spark see this. This tools let's you write data in "delta" format (Spark + metadata files). You can even recover or query data to a point in time. Be careful it recently wasn't fully working with Athena/Presto (in open-source) since you needed to create a manifest (but a fix was on it's way).

0
Ajit Nagar On

you can define job bookmark at ETL job.

bookmark keeps the track of s3 files processing so once your historical load is processed and if you dump the new files on s3 then only new files are going to processed by etl job and marked those file as processed internally.

you can handle this way incremental data.