ElasticSearch update all documents with bulk API and script

1.5k views Asked by At

I have an application where previously an document field was not required to be an array such as

"tags" : "tag1"

Now the application requires the field to be an array like such

"tags" : ["tag1","tag2"]

currently in ElasticSearch there are 4.5M documents

So I wrote a bash script to update 1000 documents but it takes over 2 minutes which means it will take over 8 days to run over the 4.5M documents. This seems like I'm doing something wrong. What is the best way to do this in elastic? Here is the bash script

#!/bin/bash 
echo "Starting"
IDS=$(curl -XGET 'http://elastichost/index/_search?size=1000' -d '{ "query" : {"match_all" : {}},"fields":"[_id]"}' | grep -Po '"_id":.*?[^\\]",'| awk -F':' '{print $2}'| sed -e 's/^"//' -e 's/",$//')
#Create an array out of the IDS
array=($IDS)
#Loop through the IDS and update them
for i in "${!array[@]}"
    do
        echo "$i=>|${array[i]}|"
            curl -XPOST "http://elastichost/index/type/${array[i]}/_update" -d '
              {
                "script" : "ctx._source.tags = [ctx._source.tags]"
              }'
    done
echo "\nFinished"
1

There are 1 answers

0
Noproblem On

Add the "> /dev/null 2>&1 &" to you command, to ensure the process gets properly forked and doesn’t log anywhere.

The equivalent shell command looks like this:

    curl -XPOST "http://elastichost/index/type/${array[i]}/_update" -d '
      {
        "script" : "ctx._source.tags = [ctx._source.tags]"
      }' > /dev/null 2>&1 &

It takes a little over 1ms to fork the process, which then uses around 4k of resident memory. While the curl process takes the standard SSL 300ms to make the request

On my moderately sized machine, I can fork around 100 HTTPS curl requests per second without them stacking up in memory. Without SSL, it can do significantly more:

  • forking a process without waiting for the output is fast.
  • curl takes the same time to make a request as socket, but it is processed out of band.
  • Forking curl requires only normal unix primitives.
  • Forking sets a single request back only a few milliseconds, but many concurrent forks will start to slow your servers.

Do not echo anything in your terminal.

Reference: link