I have an application where previously an document field was not required to be an array such as
"tags" : "tag1"
Now the application requires the field to be an array like such
"tags" : ["tag1","tag2"]
currently in ElasticSearch there are 4.5M documents
So I wrote a bash script to update 1000 documents but it takes over 2 minutes which means it will take over 8 days to run over the 4.5M documents. This seems like I'm doing something wrong. What is the best way to do this in elastic? Here is the bash script
#!/bin/bash
echo "Starting"
IDS=$(curl -XGET 'http://elastichost/index/_search?size=1000' -d '{ "query" : {"match_all" : {}},"fields":"[_id]"}' | grep -Po '"_id":.*?[^\\]",'| awk -F':' '{print $2}'| sed -e 's/^"//' -e 's/",$//')
#Create an array out of the IDS
array=($IDS)
#Loop through the IDS and update them
for i in "${!array[@]}"
do
echo "$i=>|${array[i]}|"
curl -XPOST "http://elastichost/index/type/${array[i]}/_update" -d '
{
"script" : "ctx._source.tags = [ctx._source.tags]"
}'
done
echo "\nFinished"
Add the "> /dev/null 2>&1 &" to you command, to ensure the process gets properly forked and doesn’t log anywhere.
The equivalent shell command looks like this:
It takes a little over 1ms to fork the process, which then uses around 4k of resident memory. While the curl process takes the standard SSL 300ms to make the request
On my moderately sized machine, I can fork around 100 HTTPS curl requests per second without them stacking up in memory. Without SSL, it can do significantly more:
Do not echo anything in your terminal.
Reference: link