Limiting EC2 resources used by AWS data pipeline during DynamoDB table backups

1.3k views Asked by At

I need to backup 6 DynamoDB tables every couple of hours. I've created 6 pipeliness from templates and it ran great, except that it created 6 or more virtual machines which were mostly staying up. That's not the economy I can afford.

Does anyone have experience optimizing this kind of scenario?

2

There are 2 answers

0
Rohit Kulshreshtha On

Some solutions that come to mind are:

One: To ensure that EC2 resources are being terminated, you can set the terminateAfter property on the EC2 resource definition. The semantics of terminate after are discussed here - How does AWS Data Pipeline run an EC2 instance?.

Two: This thread on the AWS forum discusses how existing EC2 instance may be used by data pipeline.

Three: Using the backup pipeline template always creates a single pipeline with a single Activity for the backup that reads from a single source and writes to a single destination. You can view the JSON source of the pipeline in the AWS console and write a similar pipeline with multiple Activity instances - one for each table you want to backup. Since the pipeline definition will only have one EMR resource, only that EMR resource will do the work of all the activities.

0
AravindR On

You can set the field maxActiveInstances on the Ec2Resource object.

maxActiveInstances The maximum number of concurrent active instances of a component. For activities, setting this to 1 runs instances in strict chronological order. A value greater than 1 allows different instances of the activity to run concurrently and requires you to ensure your activity can tolerate concurrent execution.

See this: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html

Aravind. R