Elasticsearch reindex API - Not able to reindex large number of documents

8.8k views Asked by At

I'm using Elasticsearch's reindex API to migrate logs from an old cluster to a new version 7.9.2 cluster. Here is the command I'm using.

curl -X POST "new_host:9200/_reindex?pretty&refresh&wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
  "source": {
    "remote": {
      "host": "old_host:9200"
    },
    "index": "*",
    "size": 10000,    
    "query": {
      "match_all": {}      
     }
  },
  "conflicts": "proceed",
  "dest": {
    "index": "logstash"
  }
}'

This gets only the last 10000 documents or 1 batch and request gets completed after that. However, I need to reindex more than a million documents. Is there a way to make the request run for all the matched documents? Can we set the number of batches in the request or make the request issue batches till all documents are indexed?

One option I can think of is to send request recursively by modifying query on datetime. Is there a better way to do it? Can I get all the matched documents (1 million plus) in one request?

1

There are 1 answers

7
ibexit On BEST ANSWER

Remove the query and size params in order to get all the data. If you need to filter only desired documents using a query, just remove the size to fetch all matched logs.

Using wait_for_completion=false as query param will return the task id and you will be able to monitor the reindex progress using GET /_tasks/<task_id>.

If you need or want to break the reindexing into serveral steps/chunks consider using the slice feature.

BTW: Reindex one index after another instead all at one using * and consider using daily/monthly indicies as it becomes easier to resume the process on errors and manage the log retention in comparison to one whole index.

In order to improve the speed, you should reduce the replicas to 0 and set refresh_interval=-1 in the destination index bevore reindexing and reset the values afterwards.

curl -X POST "new_host:9200/_reindex?pretty&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
  "source": {
    "remote": {
      "host": "old_host:9200"
    },
    "index": "index_name"
  },
  "conflicts": "proceed",
  "dest": {
    "index": "logstash"
  }
}'

UPDATE based on comments:

While reindexing, there is at least one error what causes the reindexing to stop. The error is being caused by at least one document (id=xiB9...) having 'OK' as value in field 'fields.StatusCode'. But the mapping in the destination index has long as data type what is causing the mentioned exception.

The solution is to change the source documents StatusCode to 200 for example, but there will be probably more documents causing the very same error.

Another solution is to change the mapping in the destination index to keyword type - that requires a handmade mapping set before any data has been inserted and maybe reindexing the already present data.