elasticsearch: script access to single-metric sub-aggregations in significant_terms aggregation?

Question

elasticsearch: script access to single-metric sub-aggregations in significant_terms aggregation?

346 views Asked by Shadocko At 19 June 2015 at 14:45

The Significant Terms aggregation type in elasticsearch allows for a custom script to be run at the bucket-level to score each bucket, through the script attribute.

Is there a way to access the value of a single-metric sub-aggregation of significant_terms or are sub-aggregations computed only after buckets scoring, for non-eliminated buckets?

Also, is it possible to sub-aggregate on background set items, rather than sub-set items?

I am trying to compute the Okapi BM25 score of all terms found in a text versus a collection of texts. The complete setup is a little bit more complicated but for the purpose of illustration, I'll simplify it and suppose that two types of documents are stored in the index: words and documents.

Example document:

{
  _id: "somecollection/somedocument",
  collection: "somecollection",
  text: "this is a rather short text for the purpose of illustration"
}

Example word:

{
  _id: "somecollection/somedocument:1",
  value: "this",
  collection: "somecollection",
  document: "somecollection/somedocument",
  index: 1
}

Let's say I want to score the terms found in somecollection/somedocument. I can query for words in the document and then aggregate based on their value attribute:

GET myindex/word/_search
{
  query: {
    filtered: {
      filter: {
        term: {
          document: "somecollection/somedocument"
        }
      }
    }
  },
  size: 0,
  aggs: {
    bm25: {
      significant_terms: {
        field: "value",
        background_filter: {
          term: { collection : "somecollection" }
        },
        script: "???"
      }
    }
  }
}

In the script, for each term, _subset_freq provides the term frequency in the document (because here ES "documents" = single words), _subset_size provides the length of the document, _superset_freq provides the term frequency in the collection and _superset_size provides the total number of words in the collection.

However, BM25 scoring also requires the number of documents containing the word in the collection (e.g. the cardinality of the document field for words in the superset matching the bucket).

Another approach is to initially query for documents. Also, let's do it for every documents in the collection at once, because that is what I'm really after:

GET myindex/document/_search
{
  query: {
    filtered: {
      filter: {
        term: {
          collection: "somecollection"
        }
      }
    }
  },
  size: 0,
  aggs: {
    documents: {
      terms: {
        field: "_id"
      },
      aggs: {
        bm25: {
          significant_terms: {
            field: "text",
             background_filter: {
               term: { collection : "somecollection" }
             },
             script: "???"
          }
        }
      }
    }
  }
}

Now, _subset_freq and _subset_size are both exactly 1, _superset_freq provides the number of documents in the collection containing the term and _superset_size provides the total number of documents in the collection. We're missing both the term frequency in the document and the total term frequency in the collection. So querying for documents really doesn't help.

Is there a way to do what I'm trying to do?

The only solution I can see for now is to pre-compute and store some extra statistics with every words, meaning such stats will get out of sync when adding, editing or removing documents. Also, this precludes the possibility of working with dynamic collections, that may be based on a search filter involving more than a single field.

Original Q&A

TechQA.

elasticsearch: script access to single-metric sub-aggregations in significant_terms aggregation?

There are 0 answers

Related Questions in ELASTICSEARCH

Related Questions in INFORMATION-RETRIEVAL

Popular Questions

Popular Tags

Trending Questions