Elasticsearch terms query on 32-bit floats behaving oddly

Question

Elasticsearch terms query on 32-bit floats behaving oddly

77 views Asked by Joe - Check out my books At 19 October 2023 at 19:20

With Elasticsearch v6.8.5 and v7.11.0, I'm having trouble wrapping my head around the behavior of floats being cast to doubles and losing precision, thus breaking my term queries.

For tech debt reasons, I've got an index mapping of type float:

PUT test
{
  "mappings": {
    "properties": {
      "cid": {
        "type" : "float",
        "ignore_malformed": false,
        "coerce" : false
      }
    }
  }
}

After indexing two documents containing the cids 2219658785 and 2219658651:

POST test/_doc
{
  "cid": 2219658785
}

POST test/_doc
{
  "cid": 2219658651
}

and querying for 2219658785:

GET test/_search
{
  "query": {
    "term": {
      "cid": {
        "value": 2219658785
      }
    }
  },
  "aggs": {
    "uniqueByCid": {
      "cardinality": {
        "field": "cid"
      }
    }
  }
}

both documents are returned
and the cardinality of the cid is 1.

Very odd.

If I retain the mapping and index much smaller cids, e.g. 1 and 2, the term query does work as expected – only 1 document is returned.

So, I figure that my large cids don't fit into float and are cast do doubles because

GET test/_search
{
  "query": {
    "script": {
      "script": "Debug.explain(doc['cid']);"
    }
  }
}

prints out ScriptDocValues.Doubles.

To inspect further, I run a script utilizing DecimalFormat:

GET test/_search
{
  "query": {
    "script": {
      "script": """
          DecimalFormat df = new DecimalFormat("#");
          
          def val = doc['cid'].value;
          
          Debug.explain([val, df.format(val)]);
      """
    }
  }
}

and I see:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "painless_class" : "java.util.ArrayList",
        "to_string" : "[2.219658752E9, 2219658752]",
                                       ^^^^^^^^^^

Assuming the document from above contains cid: 2219658785, Elasticsearch has cast 2219658785 to 2219658752. But for cid: 2219658651, the script also prints out 2219658752!

Apparently, the "casted" floats (or rather longs?) seem to be capped at 2219658752.

So my questions are:

What's so special about 2219658752? I understood 32-Bit floats to be capped at (2-2^-23) × 2^127 which is significantly higher than 2219658785. Is it not?
Can I target my cids with a term query or do I have to reindex with longs or doubles?

Original Q&A

There are 1 answers

**imotov** · Accepted Answer · 2023-10-19T21:27:33+00:00

What's so special about 2219658752? I understood 32-Bit floats to be capped at (2-2^-23) × 2^127 which is significantly higher than 2219658785. Is it not?

The important part in the table that you linked is not the max value but significant bits/digits. If you take a look at how float is structured. Once your cannot fit into significant bits you start loosing precision. So while it can represent number significantly higher then your number, it starts loosing precision after you reach 16777216. Anything about 16,777,217 several number will be mapped into a single value.

So, to answer your question there is nothing special about 2,219,658,785. It's just cannot fit into float without losing precision. In this case you are loosing a bit less than 3 last decimal points. So basically all numbers between 2,219,658,624 and 2,219,658,880 are represented as the same value in elasticsearch and elasticsearch will not be able to see the difference between these numbers.

TechQA.

Elasticsearch terms query on 32-bit floats behaving oddly

There are 1 answers

Related Questions in ELASTICSEARCH

Related Questions in ELASTICSEARCH-7

Related Questions in ELASTICSEARCH-MAPPING

Related Questions in ELASTICSEARCH-6.8

Popular Questions

Popular Tags

Trending Questions