I have a sample Vespa instance and I want to train a lightgbm model from the rank-profile. https://docs.vespa.ai/documentation/learning-to-rank.html
However, anytime I specify the recall with the docID, I get 0 hits. I'm using example code from here: https://github.com/vespa-engine/sample-apps/blob/master/text-search/src/python/collect_training_data.py
body = create_request_top_hits("test", "training", hits=2)
get_features(url, body)
And this correctly returns:
[{'id': 'index:domains/0/944f3a850511f388fe97ac85',
'relevance': 1.2427330381582673,
'source': 'domains',
'fields': {'uri': '6202597992',
'rankfeatures': {'bm25(body)': 2.8145480372957787,
'nativeFieldMatch(categories)': 0.0,
'nativeFieldMatch(concepts)': 0.8591903630989031,
'nativeFieldMatch(links)': 0.0,
'nativeFieldMatch(title)': 0.0,
'nativeProximity(categories)': 0.0,
'nativeProximity(concepts)': 0.0,
'nativeProximity(links)': 0.0,
'nativeProximity(title)': 0.0,
'rankingExpression(time_ranking)': 1.0}}},
{'id': 'index:domains/0/93f92aae1d6a010c2111e9b7',
'relevance': 1.2010786365413106,
'source': 'domains',
'fields': {'uri': '6206270866',
'rankfeatures': {'bm25(body)': 2.0397289658724347,
'nativeFieldMatch(categories)': 0.0,
'nativeFieldMatch(concepts)': 0.8591903630989031,
'nativeFieldMatch(links)': 0.0,
'nativeFieldMatch(title)': 0.0,
'nativeProximity(categories)': 0.0,
'nativeProximity(concepts)': 0.0,
'nativeProximity(links)': 0.0,
'nativeProximity(title)': 0.0,
'rankingExpression(time_ranking)': 1.0}}}]
To see if recall works, we'll use the top result:
'id': 'index:domains/0/944f3a850511f388fe97ac85'
'uri': '6202597992' # docIDs are derived from the uri field
And set the recall to the docid:
doc_id = [6202597992, "6202597992", "944f3a850511f388fe97ac85"] # multiple representations...
body = create_request_specific_ids("test", "training", doc_id)
get_features(url, body)
I would expect this to return the rank features from before but instead I get 0 hits. This is the full return:
{'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 0}, 'coverage': {'coverage': 100, 'documents': 798, 'full': True, 'nodes': 5, 'results': 5, 'resultsFull': 5}}}
I've checked docs and examples and I haven't been able to find any information here. Any insights would be greatly appreciated.
The collect script/function expects that there is a field called id in your document schema. If you alter the script to use the uri field instead you should be able to retrieve the documents.