How do i combine stopword and remove space filters in elasticsearch?

1k views Asked by At

If i have to remove certain keywords and then remove all spaces in the string during index analysis, using :

'analysis' => array(
                'filter' => array(
                  'whitespace_remove' => array(
                    'type' => 'pattern_replace',
                    'pattern' => ' ',
                    'replacement' => ''
                  ),
                  'my_stop' => array(
                    'type' => 'stop',
                    'stopwords' => array('bad', 'horrible', 'useless')
                ),
                  'edge' => array(
                    'type' => 'edge_ngram',
                    'min_gram' => '1',
                    'max_gram' => '5'
                  )

                ),

and the analyzer with

'keyword_space_ngram' => array(
                        'type' => 'custom',
                        'tokenizer' => 'keyword',
                        'filter' => array(
                            'lowercase', 
                            'my_stop',
                            'whitespace_remove',
                            'edge'

                        )
                    )

How do i ensure that i apply the filters in this order, that is convert to lowercase, remove keywords , remove spaces and then perform ngram analysis?

1

There are 1 answers

0
Oscar On

You can remove stopwords and white_spaces with custom char_filter at index time:

  {
    "analysis": {
      "char_filter": {
        "whitespace_remove": {
          "type": "pattern_replace",
          "pattern": "\\s+",
          "replacement": ""
        },
        "custom_stop_words_char_filter": {
          "type": "mapping",
          "mappings": [
            "bad =>  ",
            "horrible =>  ",
            "useless =>  "
          ]
        }
      },
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["lowercase", "asciifolding"],
          "char_filter": ["custom_stop_words_char_filter", "whitespace_remove"]
        }
      }
    }
  }
  • This will transform bad angry man to angryman, for example

  • For adding your edge_ngram filter just add edge at the end of your filter array

  • Note: your stop words will only be substituted if they are lowercase