The Significant Terms aggregation type in elasticsearch allows for a custom script to be run at the bucket-level to score each bucket, through the script
attribute.
Is there a way to access the value of a single-metric sub-aggregation of significant_terms
or are sub-aggregations computed only after buckets scoring, for non-eliminated buckets?
Also, is it possible to sub-aggregate on background set items, rather than sub-set items?
I am trying to compute the Okapi BM25 score of all terms found in a text versus a collection of texts. The complete setup is a little bit more complicated but for the purpose of illustration, I'll simplify it and suppose that two types of documents are stored in the index: words and documents.
Example document:
{
_id: "somecollection/somedocument",
collection: "somecollection",
text: "this is a rather short text for the purpose of illustration"
}
Example word:
{
_id: "somecollection/somedocument:1",
value: "this",
collection: "somecollection",
document: "somecollection/somedocument",
index: 1
}
Let's say I want to score the terms found in somecollection/somedocument.
I can query for words in the document and then aggregate based on their value
attribute:
GET myindex/word/_search
{
query: {
filtered: {
filter: {
term: {
document: "somecollection/somedocument"
}
}
}
},
size: 0,
aggs: {
bm25: {
significant_terms: {
field: "value",
background_filter: {
term: { collection : "somecollection" }
},
script: "???"
}
}
}
}
In the script, for each term, _subset_freq
provides the term frequency in the document (because here ES "documents" = single words), _subset_size
provides the length of the document, _superset_freq
provides the term frequency in the collection and _superset_size
provides the total number of words in the collection.
However, BM25 scoring also requires the number of documents containing the word in the collection (e.g. the cardinality of the document
field for words in the superset matching the bucket).
Another approach is to initially query for documents. Also, let's do it for every documents in the collection at once, because that is what I'm really after:
GET myindex/document/_search
{
query: {
filtered: {
filter: {
term: {
collection: "somecollection"
}
}
}
},
size: 0,
aggs: {
documents: {
terms: {
field: "_id"
},
aggs: {
bm25: {
significant_terms: {
field: "text",
background_filter: {
term: { collection : "somecollection" }
},
script: "???"
}
}
}
}
}
}
Now, _subset_freq
and _subset_size
are both exactly 1, _superset_freq
provides the number of documents in the collection containing the term and _superset_size
provides the total number of documents in the collection. We're missing both the term frequency in the document and the total term frequency in the collection. So querying for documents really doesn't help.
Is there a way to do what I'm trying to do?
The only solution I can see for now is to pre-compute and store some extra statistics with every words, meaning such stats will get out of sync when adding, editing or removing documents. Also, this precludes the possibility of working with dynamic collections, that may be based on a search filter involving more than a single field.