I have setup ETL pipeline in AWS as follows
input_rawdata -> s3 -> lambda -> trigger spark etl script (via aws glue )-> output(s3,parquet files )
My question is lets assume the above is initial load of the data ,how do I setup to run incremental batches that come every day(or every hour) which add new rows or update existing records
a.) how do I keep appending to the same s3 parquet files. so that subsequent presto db queries result in latest data.
b.) how to handle duplicate records get the latest timestamp for the queries.
In the spark script , do I need to create Hive external table with source as s3 and use in presto db ?
thanks for any inputs.
First of all don't try to append to the files exist in s3, rather create files with multiple records.
To query s3 for analytics you can firat describe the data using AWS Athena where its data catalog is Hive metastore compatible.
To remove duplicates you can write SQL like query through Athena to query unique record sets.