I have loaded data into my elasticsearch
cluster from hive using the elasticsearch-hadoop
plugin from elastic
.
I need to fetch a count of unique account numbers. I have the following queries written in both hql
and queryDSL
, BUT they are returning different counts.
Hive Query:
select count(distinct account) from <tableName> where capacity="550";
// Returns --> 71132
Similarly, in Elasticsearch the query looks like this:
{
"query": {
"bool": {
"must": [
{"match": { "capacity": "550"}}
]
}
},
"aggs": {
"unique_account": {
"cardinality": {
"field": "account"
}
}
}
}
// Returns --> 71607
Am I doing something wrong? What can I do to match the two queries?
Note:
There are exactly the same number of records in hive and elasticsearch.
For the OP
precision_threshold
You might also want to take a look at "Support for precise cardinality aggregation #15876"
For the OP, 2
You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.