token: ["hello", "world"] {"message":"hello"} -> token: ["hello"] {"..." /> token: ["hello", "world"] {"message":"hello"} -> token: ["hello"] {"..." /> token: ["hello", "world"] {"message":"hello"} -> token: ["hello"] {"..."/>

search with filter by token count

1.3k views Asked by At

Fields in documents are analyzed, to create token.

  1. {"message":"hello world"} -> token: ["hello", "world"]
  2. {"message":"hello"} -> token: ["hello"]
  3. {"message":"world"} -> token: ["world"]
  4. {"message":"hello java"} -> token: ["hello", "java"]
  5. {"message":"java"} -> token: ["java"]

Is there a possibility to search all documents in which a specific field contains a given token and 1 or more token other token?

  • Result for the given example for token "hello" would be:
    • 1,4
  • For "world":
    • 1

As described in termvectors, one can access the tokens or statistics about them. This only works for specific documents but not as search filter for a query or aggregation.
Would be nice if someone could help.

1

There are 1 answers

0
Val On BEST ANSWER

Yes, you can use the token_count type for this. For instance, in your mapping, you can define message as a multi-field to contain the message itself (i.e. "hello", "hello world", etc) and also the number of tokens of the message. Then you'll be able to include constraints on the word count in your queries.

So your mapping for message should look like this:

curl -XPUT localhost:9200/tests -d '
{
  "mappings": {
    "test": {
      "properties": {
        "message": {
          "type": "string",           <--- message is a normal analyzed string
          "fields": {
            "word_count": {           <--- a sub-field to include the word count
              "type": "token_count",
              "store": "yes",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

Then, you can query for all documents having hello in the message, but only those whose message has more than one token. With the following query, you'll only get hello java and hello world, but not hello

curl -XPOST localhost:9200/tests/test/_search -d '
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "message": "hello"
          }
        },
        {
          "range": {
            "message.word_count": {
              "gt": 1
            }
          }
        }
      ]
    }
  }
}

Similarly, if you replace hello with world in the above query, you'll only get hello world.