distinct count is greater than doc_count in elasticsearch aggs

4.8k views Asked by At

I wrote some aggs query to get total(sum) and unique count. but the result is a little confused.

unique value is greater than doc_count.
is it possible?

I know that cardinality aggs is experimentall and can get approximate count of distinct values.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

but is's too bad result. as you can see there are many buckets which unique is larger than doc_count.
any problem with request format? or cardinality limits?

half million documents indexed
and there are 15 type of eventID
ES 1.4 using.

request

{
"size": 0,
"_source": false,
"aggs": {
    "eventIds": {
        "terms": {
            "field": "_EventID_",
            "size": 0
        },
        "aggs": {
            "unique": {
                "cardinality": {
                    "field": "UUID"
                }
            }
        }
    }
}  

response

{
"took": 383,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 550971,
    "max_score": 0,
    "hits": [

    ]
},
"aggregations": {
    "eventIds": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
            {
                "key": "red",
                "doc_count": 165110,
                "unique": {
                    "value": 27423
                }
            },
            {
                "key": "blue",
                "doc_count": 108376,
                "unique": {
                    "value": 94775
                }
            },
            {
                "key": "yellow",
                "doc_count": 78919,
                "unique": {
                    "value": 70094
                }
            },
            {
                "key": "green",
                "doc_count": 60580,
                "unique": {
                    "value": 78945
                }
            },
            {
                "key": "black",
                "doc_count": 49923,
                "unique": {
                    "value": 56200
                }
            },
            {
                "key": "white",
                "doc_count": 38744,
                "unique": {
                    "value": 45229
                }
            },

EDIT. more test

I tried once again with 1,000 precision_threshold that is filtered only one eventId
but the result's error is same. cardinality expected less than 30,000 but its over 66,000 ( this is greater than total document size)

doc_count : 65,672 ( no problem. right) cardinality : 66,037 ( greater than doc_count) actual cardinality : about 23,000 ( calculated by rdbms scripts... )

request

{
"size": 0,
"_source": false,
"query": {
    "term": {
        "_EventID_": "packdownload"
    }
},
"aggs": {
    "unique": {
        "cardinality": {
            "field": "UUID",
            "precision_threshold": 10000
        }
    }
}

}

response

{
"took": 28,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 65672,
    "max_score": 0,
    "hits": []
},
"aggregations": {
    "unique": {
        "value": 66037
    }
}

}

1

There are 1 answers

1
Ishant Barnwal On

The highest value for precision threshold is 40,000. That should slightly improve the results, but with that big a count of distinct values, there might be an error of 20% plus minus. It even happens with lesser values.