DynamoDB data load after transforming files. Any AWS service like GCP Dataflow/Apache Beam?

303 views Asked by At

New to AWS. I have a requirement to create a daily batch pipeline

  1. Read 6-10 1GB+ CSV Files. (Each file is an extract of a table from a SQL db.)
  2. Transform each file with some logic and join all files to create one item per id.
  3. Load this joined data in a single DynamoDB table with an upsert logic.

Current approach I have started with is: We have an EC2 available used for such tasks. So I am writing a python code to (1) read all CSVs, (2) convert to a denormalised JSON file and (3) import into Dynamodb using boto3

My question is that I am concerned if my data is "Big Data". Is processing 10GB data with a single Python script ok? And down the line if the file sizes become 10x, will I face scaling issues? I have only worked with GCP in the past and in this scenario I would have used DataFlow to get the task done. So is there an equivalent in AWS terms? Would be great if someone can provide some thoughts. Thanks for your time.

2

There are 2 answers

1
Pablo On BEST ANSWER

A more appropriate equivalent of Dataflow in AWS is Kinesis Data Analytics, which supports Apache Beam's Java SDK.

You can see an example of an Apache Beam pipeline running on their service.

Apache Beam is able to write to DynamoDB.

Good luck!

0
Steven Ensslen On

The AWS equivalent to Google Cloud Dataflow is AWS Glue. The documentation isn't clear but Glue does write to DynamoDB.