How to avoid index explosion in ElasticSearch

1.1k views Asked by At

I have two docs from the same index that originally look like this (only _source value is shown here)

{
    "id" : "3",
    "name": "Foo",
    "property":{
        "schemaId":"guid_of_the_RGB_schema_defined_extenally",
        "value":{
            "R":255,
            "G":100,
            "B":20
        }
    }
}
{
    "id" : "2",
    "name": "Bar",
    "property":{
        "schemaId":"guid_of_the_HSL_schema_defined_extenally",
        "value":{
            "H":255,
            "S":100,
            "L":20
        }
    }
}

The schema(used for validation of value) is stored outside of ES since it has nothing to do with the indexing. If I don't define mapping, the value field will be consider Object mapping. And its subfield will grow once there is a new subfield.

Currently, ElasticSearch supports Flattened mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html to prevent this explosion in the index. However it has a limited support for searching for inner field due to its restriction: As with queries, there is no special support for numerics — all values in the JSON object are treated as keywords. When sorting, this implies that values are compared lexicographically.

I need to be able to query the index to find the document match a given doc (e.g. B in the range [10,30])

So far I come up with a solution that structure my doc like this

{
    "id":4,
    "name":"Boo",
    "property":
    {
        "guid_of_the_normalized_RGB_schema_defined_extenally":
        {
           "R":0.1,
           "G":0.2,
           "B":0.5
        }
}

Although it does not solve my issue of the explosion in mapping, it mitigates some other issue. My mapping now will look similar like this for the field property

"property": {
        "properties": {
          "guid_of_the_RGB_schema_defined_extenally": {
            "properties": {
              "B": {
                "type": "long"
              },
              "G": {
                "type": "long"
              },
              "R": {
                "type": "long"
              }
            }
          },
          "guid_of_the_normalized_RGB_schema_defined_extenally": {
            "properties": {
              "B": {
                "type": "float"
              },
              "G": {
                "type": "float"
              },
              "R": {
                "type": "float"
              }
            },
          "guid_of_the_HSL_schema_defined_extenally": {
            "properties": {
              "B": {
                "type": "float"
              },
              "G": {
                "type": "float"
              },
              "R": {
                "type": "float"
              }
            }
          }
        }
      }

This solve the issue with the case where the field have the same name but different data type.

Can someone suggest me a solution that could solve the explosion of indices with out suffering from the limit that the Flattened has in searching?

1

There are 1 answers

0
Jaycreation On BEST ANSWER

To avoid mapping explosion, the best solution is to normalize your data better. You can set "dynamic": "strict", in your mapping, then a doc will be rejected if it contains a field which is not already in the mapping. After that, you can still add new fields but you will have to add them explicitly in the mapping before.

You can add a pipeline to clean up and normalize your data before ingestion.

If you don't want, or cannot reindex:

To make your query easy even if you can not know the "middle" part of your key, you can use a multimatch with a star.

GET myindex/_search
{
  "query": {
    "multi_match": {
      "query": 0.5,
      "fields": ["property.*.B"]
    }
  }
}

But you will still not be able to sort it as you want. For ordering on multiple 'unknown' field names without touching the data, you can use a script: https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-sort-context.html

But maybe you could simplify the whole process by adding a dynamic template to your index.

PUT test/_mapping
{
  "dynamic_templates": [
    {
      "unified_red": {
        "path_match": "property.*.R",
        "mapping": {
          "type": "float",
          "copy_to": "unified_color.R"
        }
      }
    },
    {
      "unified_green": {
        "path_match": "property.*.G",
        "mapping": {
          "type": "float",
          "copy_to": "unified_color.G"
        }
      }
    },
    {
      "unified_blue": {
        "path_match": "property.*.B",
        "mapping": {
          "type": "float",
          "copy_to": "unified_color.B"
        }
      }
    }
  ],
  "properties": {
    "unified_color": {
      "properties": {
        "R": {
          "type": "float"
        },
        "G": {
          "type": "float"
        },
        "B": {
          "type": "float"
        }
      }
    }
  }
}

Then you'll be able to query any value with the same query :

GET test/_search
{
  "query": {
    "range": {
      "unified_color.B": {
        "gte": 0.1,
        "lte": 0.6
      }
    }
  }
}

For already existing fields, you'll have to add the copy_to by yourself on the mapping, and after that run an _update_by_query to populate them.