content replacement in S3 files when unique id matched in both the sides by using big data solutions

227 views Asked by At

I am trying to explore on a use case like "we have huge data (50B records) in files and each file has around 50M records and each record has a unique identifier. And it is possible that a record that present in file 10 can also present in file 100 but the latest state of that record is present in file 100. Files sits in AWS S3.

Now lets say around 1B records out of 50B records needs reprocessing and once reprocessing completed, we need to identify all the files which ever has these 1B records and replace the content of those files for these 1B unique ids.

Challenges: right now, we dont have a mapping that tells which file contains what all unique ids. And the whole file replacement needs to complete in one day, which means we needs parallel execution.

We have already initiated a task for maintaining the mapping for file to unique ids, and we need to load this data while processing 1B records and look up in this data set and identify all the distinct file dates for which content replacement is required.

The mapping will be huge, because it has to hold 50B records and may increase as well as it is a growing system.

Any thoughts around this?

1

There are 1 answers

4
John Rotenstein On

You will likely need to write a custom script that will ETL all your files.

Tools such as Amazon EMR (Hadoop) and Amazon Athena (Presto) would be excellent for processing the data in the files. However, your requirement to identify the latest version of data based upon filename is not compatible with the way these tools would normally process data. (They look inside the files, not at the filenames.)

If the records merely had an additional timestamp field, then it would be rather simple for either EMR or Presto to read the files and output a new set of files with only one record for each unique ID (with the latest date).

Rather than creating a system to lookup unique IDs in files, you should have your system output a timestamp. This way, the data is not tied to a specific file and can easily be loaded and transformed based upon the contents of the file.

I would suggest:

  • Process each existing file (yes, I know you have a lot!) and add a column that represents the filename
  • Once you have a new set of input files with the filename column (that acts to identify the latest record), use Amazon Athena to read all records and output one row per unique ID (with the latest date). This would be a normal SELECT... GROUP BY statement with a little playing around to get only the latest record.
  • Athena would output new files to Amazon S3, which will contain the data with unique records. These would then be the source records for any future processing you perform.