Stemming and highlighting for phrase search

1.4k views Asked by At

My Elasticsearch index is full of large English-text documents. When I search for "it is rare", I get 20 hits with that exact phrase and when I search for "it is rarely" I get a different 10. How can I get all 30 hits at once?

I've tried creating creating a multi-field with the english analyzer (below), but if I search in that field then I only get results from parts of phrase (e.g., documents matchin it or is or rare) instead of the whole phrase.

"mappings" : {
   ...
      "text" : {
        "type" : "string",
        "fields" : {
          "english" : {
            "type" : "string",
            "store" : true,
            "term_vector" : "with_positions_offsets_payloads",
            "analyzer" : "english"
          }
        }
      },
   ...
1

There are 1 answers

0
a paid nerd On BEST ANSWER

Figured it out!

  1. Store two fields, one for the text content (text) and a sub-field with the English-ified stem words (text.english).
  2. Create a custom analyzer based off of the default English analyzer which doesn't strip stop words.
  3. Highlight both fields and check for each when displaying results to the user.

Here's my index configuration:

{
  mappings: {
    documents: {
      properties: {
        title: { type: 'string' },
        text: {
          type: 'string',
          term_vector: 'with_positions_offsets_payloads',
          fields: {
            english: {
              type: 'string',
              analyzer: 'english_nostop',
              term_vector: 'with_positions_offsets_payloads',
              store: true
            }
          }
        }
      }
    }
  },
  settings: {
    analysis: {
      filter: {
        english_stemmer: {
          type: 'stemmer',
          language: 'english'
        },
        english_possessive_stemmer: {
          type: 'stemmer',
          language: 'possessive_english'
        }
      },
      analyzer: {
        english_nostop: {
          tokenizer: 'standard',
          filter: [
            'english_possessive_stemmer',
            'lowercase',
            'english_stemmer'
          ]
        }
      }
    }
  }
}

And here's what a query looks like:

{
  query: {
    query_string: {
      query: <query>,
      fields: ['text.english'],
      analyzer: 'english_nostop'
    }
  },
  highlight: {
    fields: {
      'text.english': {}
      'text': {}
    }
  },
}