I wrote some aggs query to get total(sum) and unique count. but the result is a little confused.
unique value is greater than doc_count.
is it possible?
I know that cardinality aggs is experimentall and can get approximate count of distinct values.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html
but is's too bad result.
as you can see there are many buckets which unique is larger than doc_count.
any problem with request format? or cardinality limits?
half million documents indexed
and there are 15 type of eventID
ES 1.4 using.
request
{
"size": 0,
"_source": false,
"aggs": {
"eventIds": {
"terms": {
"field": "_EventID_",
"size": 0
},
"aggs": {
"unique": {
"cardinality": {
"field": "UUID"
}
}
}
}
}
response
{
"took": 383,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 550971,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"eventIds": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "red",
"doc_count": 165110,
"unique": {
"value": 27423
}
},
{
"key": "blue",
"doc_count": 108376,
"unique": {
"value": 94775
}
},
{
"key": "yellow",
"doc_count": 78919,
"unique": {
"value": 70094
}
},
{
"key": "green",
"doc_count": 60580,
"unique": {
"value": 78945
}
},
{
"key": "black",
"doc_count": 49923,
"unique": {
"value": 56200
}
},
{
"key": "white",
"doc_count": 38744,
"unique": {
"value": 45229
}
},
EDIT. more test
I tried once again with 1,000 precision_threshold that is filtered only one eventId
but the result's error is same.
cardinality expected less than 30,000 but its over 66,000 ( this is greater than total document size)
doc_count : 65,672 ( no problem. right) cardinality : 66,037 ( greater than doc_count) actual cardinality : about 23,000 ( calculated by rdbms scripts... )
request
{
"size": 0,
"_source": false,
"query": {
"term": {
"_EventID_": "packdownload"
}
},
"aggs": {
"unique": {
"cardinality": {
"field": "UUID",
"precision_threshold": 10000
}
}
}
}
response
{
"took": 28,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 65672,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique": {
"value": 66037
}
}
}
The highest value for precision threshold is 40,000. That should slightly improve the results, but with that big a count of distinct values, there might be an error of 20% plus minus. It even happens with lesser values.