update: Added
I want to perform unique count on my ElasticSearch cluster. The cluster contains about 50 millions of records.
I've tried the following methods:
First method
Mentioned in this section:
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.
Second method
Mentioned in this section:
Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.
My property mapping
"my_prop": {
"index": "not_analyzed",
"fielddata": {
"format": "doc_values"
},
"doc_values": true,
"type": "string",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
The problem
When I use unique count on my_prop.hash in Kibana I receive the following error:
Data too large, data for [my_prop.hash] would be larger than limit
ElasticSearch has 2g heap size. The above also fails for a single index with 4 millions of records.
My questions
- Am I missing something in my configurations?
- Should I increase my machine? This does not seem to be the scalable solution.
ElasticSearch query
Was generated by Kibana: http://pastebin.com/hf1yNLhE
That error says you don't have enough memory (more specifically, memory for
fielddata
) to store all the values fromhash
, so you need to take them out from the heap and put them on disk, meaning usingdoc_values
.Since you are already using
doc_values
formy_prop
I suggest doing the same formy_prop.hash
(and, no, the settings from the main field are not inherited by the sub-fields):"hash": { "type": "murmur3", "index" : "no", "doc_values" : true }
.