distinct count on hive does not match cardinality count on elasticsearch

808 views Asked by At

I have loaded data into my elasticsearch cluster from hive using the elasticsearch-hadoop plugin from elastic.

I need to fetch a count of unique account numbers. I have the following queries written in both hql and queryDSL, BUT they are returning different counts.

Hive Query:

select count(distinct account) from <tableName> where capacity="550";

// Returns --> 71132

Similarly, in Elasticsearch the query looks like this:

{
    "query": {
        "bool": {
            "must": [
              {"match": { "capacity": "550"}}
            ]
        }
    },
    "aggs": {
    "unique_account": {
      "cardinality": {
        "field": "account"
      }
    }
  }
}

// Returns --> 71607

Am I doing something wrong? What can I do to match the two queries?

Note: There are exactly the same number of records in hive and elasticsearch.

2

There are 2 answers

3
David דודו Markovitz On BEST ANSWER

"the first approximate aggregation provided by Elasticsearch is the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

For the OP

precision_threshold

"precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000.
...
Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

You might also want to take a look at "Support for precise cardinality aggregation #15876"

For the OP, 2

"I have tried several numbers..."

You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.

0
Pratik Patil On

Cardinality does not ensure accurate count even with 40000 precision_threshold. There is another way to get accurate distinct count of a field.

This article on "Accurate Distinct Count and Values from Elasticsearch" explains in detail the solution as well as it's accuracy over Cardinality.