How to add pre-existing data from DynamoDB to Elasticsearch?

5.6k views Asked by At

I set up Elasticsearch Service and DynamoDb stream as described in this blog post. Now I need to add pre-existing data from DynamoDB to Elasticsearch.

I saw "Indexing pre-existing content" part of article but I dont know what to do with that Python code, where to execute it.

What the best option in this case to add pre-existing data?

2

There are 2 answers

1
Tony V. On BEST ANSWER

In this post described how to add pre-existing data from DynamoDB to Elasticsearch.

1
best wishes On

Populating existing items to elasticsearch is not straightforward since dynamodb stream works for item changes not for existing records,

Here are few approaches with pro and cons

  1. Scan all the existing items from dynamodb and send to elasticsearch

    We can scan all the existing items and run a python code hosted on a ec2 machine to send data to es.

    Pros:

    a. Simple solution, nothing much required.

    Cons:

    a. Can not be run on a lambda function since the job may timeout if number of records are too many.

    b. This approach is more of a one time thing and can not be used for incremental changes, (let's say we want to keep updating es as dynamodb data changes.)

  2. Use dynamodb streams

    We can enable dynamodb streams and build the pipeline as explained here. Now we can update some flag of existing items so that all the records flow through the pipeline and data goes to es.

    Pros:

    a. The pipeline can be used for incremental dynamodb changes.

    b. No code duplication or one time effort, Every time we need to update one item in es, we update the item and it gets indexed in es.

    c. No redundant, untested, one time code. (Huge issue in software world to maintain code.)

    Cons:

    a. Changing Prod data can be a dangerous thing and may not be allowed depending on use case.

  3. This is slight modification of above approach

    Instead of changing item in prod table we can create a Temporary table and enable stream on Temporary table. Utilize the pipeline mentioned in 2nd approach. And then copy items from prod table to Temporary table, The data will flow through the existing pipeline and get indexed in ES.

    Pros:

    a. No Prod data change is required and this pipeline can be used for incremental changes as well.

    b. same as approach 2.

    Cons:

    a. Copying data from one table to another may take lots of time depending on data size.

    b. Copying data from one table to another is a one time script, hence has maintainability issues.

Feel free to edit or suggest another approaches in comment.