If the same document is ingested at two different times, how to have the same id in Elasticsearch

19 views Asked by At

Let's say I have 50 documents I want to ingest into an index. So I do this and I have 50 documents I can retrieve if I were to query Elasticsearch.

At a later time, perhaps through an automated process, these same 50 documents end up getting ingested again. I do a query and I see pairs of documents with everything except for different _id value. Actually the _id value which is 20 characters for the pair has characters 0,1 different and characters 16-19 different but characters 2-15 the exact same. I assume these _id are autogenerated, maybe the first 2 characters being some sort of sequence number?

But how would I go about having the document, each time, map to the same _id?

I expect each unique document to map to the same _id value so that my index is not filled up with the exact same information multiple times.

1

There are 1 answers

1
Tim On

You can use the fingerprint ingest processor to calculate an _id field from the fields in your document.

You'll need to decide which fields to use, that is, what makes a document "unique" - is it all fields, or are there specific identifier fields you want to use.

Then you define an ingest pipeline such as

PUT _ingest/pipeline/auto-id
{
  "processors": [
    {
      "fingerprint": {
        "fields": ["field1", "field2", "field3"],
        "target_field": "_id"
      }
    }
  ]
}

When you ingest documents, you specify that you want to use this pipeline

POST my-index/_doc?pipeline=id-fingerprint
{
  "field1": "one",
  "field2": "2",
  "field3": 3,
}