ElasticSearch + Kibana - Unique count using pre-computed hashes

756 views Asked by At

update: Added

I want to perform unique count on my ElasticSearch cluster. The cluster contains about 50 millions of records.

I've tried the following methods:

First method

Mentioned in this section:

Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.

Second method

Mentioned in this section:

Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.

My property mapping

"my_prop": {
  "index": "not_analyzed",
  "fielddata": {
    "format": "doc_values"
  },
  "doc_values": true,
  "type": "string",
  "fields": {
    "hash": {
      "type": "murmur3"
    }
  }
}

The problem

When I use unique count on my_prop.hash in Kibana I receive the following error:

Data too large, data for [my_prop.hash] would be larger than limit

ElasticSearch has 2g heap size. The above also fails for a single index with 4 millions of records.

My questions

  1. Am I missing something in my configurations?
  2. Should I increase my machine? This does not seem to be the scalable solution.

ElasticSearch query

Was generated by Kibana: http://pastebin.com/hf1yNLhE

ElasticSearch Stack trace

http://pastebin.com/BFTYUsVg

1

There are 1 answers

0
Andrei Stefan On BEST ANSWER

That error says you don't have enough memory (more specifically, memory for fielddata) to store all the values from hash, so you need to take them out from the heap and put them on disk, meaning using doc_values.

Since you are already using doc_values for my_prop I suggest doing the same for my_prop.hash (and, no, the settings from the main field are not inherited by the sub-fields): "hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.