distinct count on hive does not match cardinality count on elasticsearch

Question

distinct count on hive does not match cardinality count on elasticsearch

809 views Asked by summerNight At 15 September 2017 at 20:12

I have loaded data into my elasticsearch cluster from hive using the elasticsearch-hadoop plugin from elastic.

I need to fetch a count of unique account numbers. I have the following queries written in both hql and queryDSL, BUT they are returning different counts.

Hive Query:

select count(distinct account) from <tableName> where capacity="550";

// Returns --> 71132

Similarly, in Elasticsearch the query looks like this:

{
    "query": {
        "bool": {
            "must": [
              {"match": { "capacity": "550"}}
            ]
        }
    },
    "aggs": {
    "unique_account": {
      "cardinality": {
        "field": "account"
      }
    }
  }
}

// Returns --> 71607

Am I doing something wrong? What can I do to match the two queries?

Note: There are exactly the same number of records in hive and elasticsearch.

Original Q&A

There are 2 answers

Pratik Patil On 20 January 2021 at 18:25

Cardinality does not ensure accurate count even with 40000 precision_threshold. There is another way to get accurate distinct count of a field.

This article on "Accurate Distinct Count and Values from Elasticsearch" explains in detail the solution as well as it's accuracy over Cardinality.

**David דודו Markovitz** · Accepted Answer · 2017-09-15T20:41:08+00:00

"the first approximate aggregation provided by Elasticsearch is the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

For the OP

precision_threshold

"precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000.
...
Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

You might also want to take a look at "Support for precise cardinality aggregation #15876"

For the OP, 2

"I have tried several numbers..."

You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.

TechQA.

distinct count on hive does not match cardinality count on elasticsearch

There are 2 answers

For the OP

For the OP, 2

Related Questions in ELASTICSEARCH

Related Questions in HIVE

Related Questions in HIVEQL

Related Questions in QUERYDSL

Related Questions in ELASTICSEARCH-HADOOP

Popular Questions

Popular Tags

Trending Questions