String comparison in Vespa ranking expression behaving oddly

66 views Asked by At

I'm experiencing strange behavior when I try to write a ranking expression in Vespa that compares strings that hold double-ish values (e.g. the string "11.10"). I've constructed a minimal example below that I believe demonstrates the behavior.

Vespa app directory structure:

 ❯ tree
.
├── example
│   ├── schema
│   │   └── documents.sd
│   └── services.xml
└── minimal_feed.json

My documents:

❯ cat minimal_feed.json
{"put": "id:namespace:documents::doc1", "fields": {"field1": "some text", "field2": "value1", "field3": "1.0"}}
{"put": "id:namespace:documents::doc2", "fields": {"field1": "other text", "field2": "value2", "field3": "2.0"}}

I will be querying on field1 and using field3 for relevance. I include field2 as a working counter-example to field3.

My schema:

❯ cat example/schema/documents.sd
schema documents {
    document documents {
        field field1 type string {
            indexing: summary | index
            match: text
        }
        field field2 type string {
            indexing: summary | attribute
            match: exact
        }
        field field3 type string {
            indexing: summary | attribute
            match: exact
        }
    }

    rank-profile custom_rank {
        first-phase {
            expression: if(attribute(field3) == query(field3_query), 1, 0)
        }
    }
}

My services.xml

❯ cat example/services.xml
<services version="1.0">
    <container id="default" version="1.0">
        <search />
        <document-api />
    </container>

    <content id="content" version="1.0">
        <redundancy>1</redundancy>
        <documents>
            <document type="documents" mode="index"/>
        </documents>
        <nodes count="1"/>
    </content>
</services>

Deploy, feed and query

❯ vespa deploy --wait 300 example/
Waiting up to 5m0s for deploy API...
Uploading application package... done

❯ vespa feed -t http://localhost:8080 minimal_feed.json
{
  "feeder.operation.count": 2,
  "feeder.seconds": 1.203,
  "feeder.ok.count": 2,
  "feeder.ok.rate": 1.662,
  "feeder.error.count": 0,
  "feeder.inflight.count": 0,
  "http.request.count": 2,
  "http.request.bytes": 143,
  "http.request.MBps": 0.000,
  "http.exception.count": 0,
  "http.response.count": 2,
  "http.response.bytes": 184,
  "http.response.MBps": 0.000,
  "http.response.error.count": 0,
  "http.response.latency.millis.min": 1200,
  "http.response.latency.millis.avg": 1200,
  "http.response.latency.millis.max": 1200,
  "http.response.code.counts": {
    "200": 2
  }
}

❯ curl -X GET "http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20field1%20contains%20%27text%27;&ranking.profile=custom_rank&ranking.features.query(field3_query)=1.0" | jq .

{
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 2
    },
    "coverage": {
      "coverage": 100,
      "documents": 2,
      "full": true,
      "nodes": 1,
      "results": 1,
      "resultsFull": 1
    },
    "children": [
      {
        "id": "id:namespace:documents::doc1",
        "relevance": 0,
        "source": "content",
        "fields": {
          "sddocname": "documents",
          "documentid": "id:namespace:documents::doc1",
          "field1": "some text",
          "field2": "value1",
          "field3": "1.0"
        }
      },
      {
        "id": "id:namespace:documents::doc2",
        "relevance": 0,
        "source": "content",
        "fields": {
          "sddocname": "documents",
          "documentid": "id:namespace:documents::doc2",
          "field1": "other text",
          "field2": "value2",
          "field3": "2.0"
        }
      }
    ]
  }
}

Note that the relevance is all 0. It doesn't seem to be matching the field.

If I switch the first-phase expression from using field3 to using field2 then it works. If I re-index field3 but prefix all the values with a character then it also works. This suggests that maybe something in Vespa is interpreting the string "1.0" as a double and the comparison is failing as a result?

1

There are 1 answers

0
Jo Kristian Bergum On

This suggests that maybe something in Vespa is interpreting the string "1.0" as a double and the comparison is failing as a result?

Yes. Note that this is not related to query matching but to string comparisons in ranking using strings as ranking features. That is, reading an attribute string value from memory (attribute), calculating a hash of it, and comparing it with a hash of the unspecified string ranking feature query input, which in this case is parsed to a double.

This is not the same as producing a query match against the field, which can be done with YQL (field3 contains "!") and use matches or other text scoring features in ranking expressions instead of attribute(x) == query(undefined_input_string_feature_which_might_be_treated_as_a_number)

It's generally better to use Tensors for query inputs used in ranking as in Vespa, which are typed, so you need to define the query input tensors used in the ranking expression to avoid issues like here.