Support for both EdegeNGram analysis and phrase search in Solr 3.4.0

1.5k views Asked by At

I want to enable "startsWith" search for each term in a SOLR query but also being able to perform phrase searches (given in quotes). For the prefix search firstly I added the suffix "*". This solution allows both prefix search and phrase search but I don't like this solution because it's a wildcard search and the wildcard searches doesn't analyze the terms.

So I enabled the EdgeNgramFilterFactory only on indexing. The prefix search works fine but the exact phrase search doesn't work anymore.

Does anyone know how to enable phrase search even when the EdgeNgram is enabled?

Thanks!

Here is the schema.xml

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="back" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Also I have noticed that when using the WordDelimiterFilterFactory the highlighting doesn't perform well anymore.

3

There are 3 answers

0
Persimmonium On

Yet another option is upgrade to 3.6.0 as now wildcards don't prevent the query being analyzed

0
Grimmo On

Phrase search does not work because EdgeNGram produces additional terms and increases the term position(surprisingly) of each chunk of the word. Phrases are expected to be exact, meaning distance(slops) between two sequential terms is 1. But with chunks indexed text looks different. Imagine you have indexed the text "Hello World" using <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" side="front"/>. Then indexed text would look like "he hel hell hello wo wor worl world". You would find the phrase "hel hell" rather than "hello world".

enter image description here

As an option you could allow some distance between words by increasing qs parameter of the query parser (dismax).

But 'not exact phrase' search may be unacceptable as you would find additional unexpected phrases like 'hel hell'.

A better option is to use a separate field for ngrams. In this case text will be indexed in two fields and ngrams will not break the original text.

0
Max Schmidt On

You can use two field - one for prefix and suffix search and another one for exact match.

  <field indexed="true" name="myfield_edgy"        type="edgy"/>
  <field indexed="true" name="myfield_exactmatch"  type="exactmatch"/>
  <copyField source="myfield_exactmatch" dest="myfield_edgy"/>

Now you can search in both field and even use different boosts, i.e. to rank matches in myfield_exactmatch higher.