Elasticsearch terms query on 32-bit floats behaving oddly

85 views Asked by At

With Elasticsearch v6.8.5 and v7.11.0, I'm having trouble wrapping my head around the behavior of floats being cast to doubles and losing precision, thus breaking my term queries.

For tech debt reasons, I've got an index mapping of type float:

PUT test
{
  "mappings": {
    "properties": {
      "cid": {
        "type" : "float",
        "ignore_malformed": false,
        "coerce" : false
      }
    }
  }
}

After indexing two documents containing the cids 2219658785 and 2219658651:

POST test/_doc
{
  "cid": 2219658785
}

POST test/_doc
{
  "cid": 2219658651
}

and querying for 2219658785:

GET test/_search
{
  "query": {
    "term": {
      "cid": {
        "value": 2219658785
      }
    }
  },
  "aggs": {
    "uniqueByCid": {
      "cardinality": {
        "field": "cid"
      }
    }
  }
}
  • both documents are returned
  • and the cardinality of the cid is 1.

Very odd.

If I retain the mapping and index much smaller cids, e.g. 1 and 2, the term query does work as expected – only 1 document is returned.

So, I figure that my large cids don't fit into float and are cast do doubles because

GET test/_search
{
  "query": {
    "script": {
      "script": "Debug.explain(doc['cid']);"
    }
  }
}

prints out ScriptDocValues.Doubles.

To inspect further, I run a script utilizing DecimalFormat:

GET test/_search
{
  "query": {
    "script": {
      "script": """
          DecimalFormat df = new DecimalFormat("#");
          
          def val = doc['cid'].value;
          
          Debug.explain([val, df.format(val)]);
      """
    }
  }
}

and I see:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "painless_class" : "java.util.ArrayList",
        "to_string" : "[2.219658752E9, 2219658752]",
                                       ^^^^^^^^^^

Assuming the document from above contains cid: 2219658785, Elasticsearch has cast 2219658785 to 2219658752. But for cid: 2219658651, the script also prints out 2219658752!

Apparently, the "casted" floats (or rather longs?) seem to be capped at 2219658752.

So my questions are:

  1. What's so special about 2219658752? I understood 32-Bit floats to be capped at (2-2^-23) × 2^127 which is significantly higher than 2219658785. Is it not?
  2. Can I target my cids with a term query or do I have to reindex with longs or doubles?
1

There are 1 answers

3
imotov On BEST ANSWER

What's so special about 2219658752? I understood 32-Bit floats to be capped at (2-2^-23) × 2^127 which is significantly higher than 2219658785. Is it not?

The important part in the table that you linked is not the max value but significant bits/digits. If you take a look at how float is structured. Once your cannot fit into significant bits you start loosing precision. So while it can represent number significantly higher then your number, it starts loosing precision after you reach 16777216. Anything about 16,777,217 several number will be mapped into a single value.

So, to answer your question there is nothing special about 2,219,658,785. It's just cannot fit into float without losing precision. In this case you are loosing a bit less than 3 last decimal points. So basically all numbers between 2,219,658,624 and 2,219,658,880 are represented as the same value in elasticsearch and elasticsearch will not be able to see the difference between these numbers.