Apache solr search issue

102 views Asked by At

i've got a search issue with apachesolr.

For example The contents that i've indexed are:

  • Tiramisu d'hiver
  • Velouté d'hiver
  • Minestrone d'hiver crémeux,
  • Smoothie version hiver

when i search "hiver", i get only Smoothie version hiver as results.

When i search dhiver, i get as results

  • Tiramisu d'hiver
  • Velouté d'hiver
  • Minestrone d'hiver crémeux

I need to get all results whether i search hiver or dhiver or dhiver

Any one have an idea what is the problem? Do i have to change something in my schema.xml ?

My schema for textfield is :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.WordDelimiterFilterFactory" 
          generateWordParts="1" 
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          preserveOriginal="1"
    />
    <filter class="solr.LengthFilterFactory" min="3" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.WordDelimiterFilterFactory" 
          generateWordParts="1" 
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="0"
          catenateAll="0"
          splitOnCaseChange="1"
          splitOnNumerics="1"
    />
    <filter class="solr.LengthFilterFactory" min="3" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

  </analyzer>

  <analyzer type="multiterm">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            protected="protwords.txt"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="0"
            splitOnCaseChange="1"
            preserveOriginal="1"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>
1

There are 1 answers

0
David George On BEST ANSWER

Hmmm tasty.

First point, for all these kind of problems use the Solr Analysis tool is your friend. Second, remember that Solr only matches if the query and terms are 100% character for character identical.

For the following filter

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />

Velouté d'hiver will be analyzed as

veloute | d'hiver | d | dhiver | hiver

So will match your query for hiver - you may want to remove the | d | token that my filter generated.

Remember to fold accent characters too somewhere.