With Elasticsearch v6.8.5 and v7.11.0, I'm having trouble wrapping my head around the behavior of floats being cast to doubles and losing precision, thus breaking my term
queries.
For tech debt reasons, I've got an index mapping of type float
:
PUT test
{
"mappings": {
"properties": {
"cid": {
"type" : "float",
"ignore_malformed": false,
"coerce" : false
}
}
}
}
After indexing two documents containing the cid
s 2219658785 and 2219658651:
POST test/_doc
{
"cid": 2219658785
}
POST test/_doc
{
"cid": 2219658651
}
and querying for 2219658785
:
GET test/_search
{
"query": {
"term": {
"cid": {
"value": 2219658785
}
}
},
"aggs": {
"uniqueByCid": {
"cardinality": {
"field": "cid"
}
}
}
}
- both documents are returned
- and the cardinality of the
cid
is 1.
Very odd.
If I retain the mapping and index much smaller cid
s, e.g. 1
and 2
, the term query does work as expected – only 1 document is returned.
So, I figure that my large cid
s don't fit into float
and are cast do doubles
because
GET test/_search
{
"query": {
"script": {
"script": "Debug.explain(doc['cid']);"
}
}
}
prints out ScriptDocValues.Doubles
.
To inspect further, I run a script utilizing DecimalFormat
:
GET test/_search
{
"query": {
"script": {
"script": """
DecimalFormat df = new DecimalFormat("#");
def val = doc['cid'].value;
Debug.explain([val, df.format(val)]);
"""
}
}
}
and I see:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"painless_class" : "java.util.ArrayList",
"to_string" : "[2.219658752E9, 2219658752]",
^^^^^^^^^^
Assuming the document from above contains cid: 2219658785
, Elasticsearch has cast 2219658785 to 2219658752.
But for cid: 2219658651
, the script also prints out 2219658752!
Apparently, the "casted" float
s (or rather long
s?) seem to be capped at 2219658752.
So my questions are:
- What's so special about 2219658752? I understood 32-Bit floats to be capped at (2-2^-23) × 2^127 which is significantly higher than 2219658785. Is it not?
- Can I target my
cid
s with aterm
query or do I have to reindex withlong
s ordouble
s?
The important part in the table that you linked is not the max value but significant bits/digits. If you take a look at how float is structured. Once your cannot fit into significant bits you start loosing precision. So while it can represent number significantly higher then your number, it starts loosing precision after you reach
16777216
. Anything about 16,777,217 several number will be mapped into a single value.So, to answer your question there is nothing special about 2,219,658,785. It's just cannot fit into float without losing precision. In this case you are loosing a bit less than 3 last decimal points. So basically all numbers between 2,219,658,624 and 2,219,658,880 are represented as the same value in elasticsearch and elasticsearch will not be able to see the difference between these numbers.