Find documents in Elasticsearch where `ignore_malformed` was triggered

977 views Asked by At

Elasticsearch by default throws an exception if inserting data to a field which does not fit the existing type. For example, if a field has been created as number type, inserting a document with a string value for that field causes an error.

This behavior can be changed by enabling then ignore_malformed setting, which means such fields are silently ignored for indexing purposes, but retained in the _source document - meaning that the invalid values cannot be searched or aggregated, but are still included in the returned document.

This is preferable behavior in our use case, but we would wish to be able to locate such documents somehow so we can fix them in the future.

Is there any way to somehow flag documents for which some malformed fields were ignored? We control the document insertion process fully, so we can modify all insertion flags, or do a trial insert, or anything, to reach our goal.

2

There are 2 answers

1
alr On

You can use the exists query to find document where this field does not exist, see this example

PUT foo
{
  "mappings": {
    "bar": {
      "properties": {
        "baz": {
          "type": "integer",
          "ignore_malformed": true
        }
      }
    }
  }
}

PUT foo/bar/1
{
  "baz": "field"
}

GET foo/bar/_search
{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must_not": [
            {
              "exists": {
                "field": "baz"
              }
            }
          ]
        }
      }
    }
  }
}

There is no dedicated mechanism though, so this search finds also documents where the field is not set intentionally

0
Thomas Decaux On

You cannot, when you search on elasticsearch, you don't search on document source but on the inverted index, which contains the analyzed data.

ignore_malformed flag is saying "always store document, analyze if possible".

You can try, create a mal-formed document, and use _termvectors API to see how the document is analyzed and stored in the inverted index, in a case of a string field, you can see an "Array" is stored as an empty string etc.. but the field will exists.

So forget the inverted index, let's use the source!

  1. Scroll all your data until you find the anomaly, I use a small python script that search scroll, unserialize and I test field type for every documents (very long) but I can have a list of wrong document IDs.

  2. Use a script query can be very long and crash your cluster, use with caution, maybe as a post_filter:

Here I want to retrieve the document where country_name is not a string:

{
   "_source": false,
   "timeout" : "30s",
    "query" : {
        "query_string" : {
            "query" : "locale:de_ch"
        }
    },
    "post_filter": {
        "script": {
            "script": "!(_source.country_name instanceof String)"
        }
    }
}
  • "_source:false" => I want only document ID
  • "timeout" => prevent crash

As you notice, this is a missing feature, I know logstash will tag document that fail, so elasticsearch could implement the same thing.