I'm experiencing strange behavior when I try to write a ranking expression in Vespa that compares strings that hold double-ish values (e.g. the string "11.10"). I've constructed a minimal example below that I believe demonstrates the behavior.
Vespa app directory structure:
❯ tree
.
├── example
│ ├── schema
│ │ └── documents.sd
│ └── services.xml
└── minimal_feed.json
My documents:
❯ cat minimal_feed.json
{"put": "id:namespace:documents::doc1", "fields": {"field1": "some text", "field2": "value1", "field3": "1.0"}}
{"put": "id:namespace:documents::doc2", "fields": {"field1": "other text", "field2": "value2", "field3": "2.0"}}
I will be querying on field1
and using field3
for relevance. I include field2
as a working counter-example to field3
.
My schema:
❯ cat example/schema/documents.sd
schema documents {
document documents {
field field1 type string {
indexing: summary | index
match: text
}
field field2 type string {
indexing: summary | attribute
match: exact
}
field field3 type string {
indexing: summary | attribute
match: exact
}
}
rank-profile custom_rank {
first-phase {
expression: if(attribute(field3) == query(field3_query), 1, 0)
}
}
}
My services.xml
❯ cat example/services.xml
<services version="1.0">
<container id="default" version="1.0">
<search />
<document-api />
</container>
<content id="content" version="1.0">
<redundancy>1</redundancy>
<documents>
<document type="documents" mode="index"/>
</documents>
<nodes count="1"/>
</content>
</services>
Deploy, feed and query
❯ vespa deploy --wait 300 example/
Waiting up to 5m0s for deploy API...
Uploading application package... done
❯ vespa feed -t http://localhost:8080 minimal_feed.json
{
"feeder.operation.count": 2,
"feeder.seconds": 1.203,
"feeder.ok.count": 2,
"feeder.ok.rate": 1.662,
"feeder.error.count": 0,
"feeder.inflight.count": 0,
"http.request.count": 2,
"http.request.bytes": 143,
"http.request.MBps": 0.000,
"http.exception.count": 0,
"http.response.count": 2,
"http.response.bytes": 184,
"http.response.MBps": 0.000,
"http.response.error.count": 0,
"http.response.latency.millis.min": 1200,
"http.response.latency.millis.avg": 1200,
"http.response.latency.millis.max": 1200,
"http.response.code.counts": {
"200": 2
}
}
❯ curl -X GET "http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20field1%20contains%20%27text%27;&ranking.profile=custom_rank&ranking.features.query(field3_query)=1.0" | jq .
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 2
},
"coverage": {
"coverage": 100,
"documents": 2,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:namespace:documents::doc1",
"relevance": 0,
"source": "content",
"fields": {
"sddocname": "documents",
"documentid": "id:namespace:documents::doc1",
"field1": "some text",
"field2": "value1",
"field3": "1.0"
}
},
{
"id": "id:namespace:documents::doc2",
"relevance": 0,
"source": "content",
"fields": {
"sddocname": "documents",
"documentid": "id:namespace:documents::doc2",
"field1": "other text",
"field2": "value2",
"field3": "2.0"
}
}
]
}
}
Note that the relevance is all 0. It doesn't seem to be matching the field.
If I switch the first-phase expression from using field3
to using field2
then it works. If I re-index field3
but prefix all the values with a character then it also works. This suggests that maybe something in Vespa is interpreting the string "1.0" as a double and the comparison is failing as a result?
Yes. Note that this is not related to query matching but to string comparisons in ranking using strings as ranking features. That is, reading an attribute string value from memory (attribute), calculating a hash of it, and comparing it with a hash of the unspecified string ranking feature query input, which in this case is parsed to a double.
This is not the same as producing a query match against the field, which can be done with YQL (
field3 contains "!"
) and usematches
or other text scoring features in ranking expressions instead ofattribute(x) == query(undefined_input_string_feature_which_might_be_treated_as_a_number)
It's generally better to use Tensors for query inputs used in ranking as in Vespa, which are typed, so you need to define the query input tensors used in the ranking expression to avoid issues like here.