ElasticSearch : Search_as_you_type field, how does it tokenize?

369 views Asked by At

I am reading the official doc at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html and I do not understant how the search_as_you_type field works.

If have the following setting :

{
  "settings": {
    "analysis": {
      "tokenizer": {
         "ngrams": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10
        }
      },
      "analyzer": {
        "partial_words" : {
          "type": "custom",
          "tokenizer": "ngrams",
          "filter": ["lowercase"]
        }
      }
    }
  },
   "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "fields": {
          "shingles": { 
            "type": "search_as_you_type",
            "analyzer": "partial_words",
            "term_vector": "with_positions_offsets"
          },
          "ngrams": {
            "type": "text",
            "analyzer": "partial_words",
            "search_analyzer": "standard",
            "term_vector": "with_positions_offsets"
          }
        }
      }
    }
  }
}

I would like to know how the my.text.shingles is tokenized. For instance, the text

"Martin Luther was a german priest" 

is analyzed at the index time in the "my_text" field with the analyzer "partial_words" How does it work in the shingles fields ? Which tokens should I have in

1) my_text.shingles
2) my_text.shingles._2gram
3) my_text.shingles._3gram

Thanks for your light !

EDIT : is there any way to be sure (or any query) to know that the _ngram fields are giving those tokens ?

1) my_text.shingles
[Martin, Luther, was, a, german, priest]

2) my_text.shingles._2gram
[Martin Luther, Luther was, was a, a german, german priest]

3) my_text.shingles._3gram
[Martin Luther was, Luther was a, was a german, a german priest]
1

There are 1 answers

2
Musab Dogan On

You can check this article to understand more. Simply it's tokenizing the words like the following image.

enter image description here

You can use _analyze API to see how the text is tokenized.

POST test_search_as_you_type2/_analyze
{
  "analyzer": "partial_words",
  "text": ["Martin Luther was a german priest"]
}