I am trying to explore on a use case like "we have huge data (50B records) in files and each file has around 50M records and each record has a unique identifier. And it is possible that a record that present in file 10 can also present in file 100 but the latest state of that record is present in file 100. Files sits in AWS S3.
Now lets say around 1B records out of 50B records needs reprocessing and once reprocessing completed, we need to identify all the files which ever has these 1B records and replace the content of those files for these 1B unique ids.
Challenges: right now, we dont have a mapping that tells which file contains what all unique ids. And the whole file replacement needs to complete in one day, which means we needs parallel execution.
We have already initiated a task for maintaining the mapping for file to unique ids, and we need to load this data while processing 1B records and look up in this data set and identify all the distinct file dates for which content replacement is required.
The mapping will be huge, because it has to hold 50B records and may increase as well as it is a growing system.
Any thoughts around this?
You will likely need to write a custom script that will ETL all your files.
Tools such as Amazon EMR (Hadoop) and Amazon Athena (Presto) would be excellent for processing the data in the files. However, your requirement to identify the latest version of data based upon filename is not compatible with the way these tools would normally process data. (They look inside the files, not at the filenames.)
If the records merely had an additional timestamp field, then it would be rather simple for either EMR or Presto to read the files and output a new set of files with only one record for each unique ID (with the latest date).
Rather than creating a system to lookup unique IDs in files, you should have your system output a timestamp. This way, the data is not tied to a specific file and can easily be loaded and transformed based upon the contents of the file.
I would suggest: