Solr SnowballPorterFilterFactory for index and query analyzers

7.5k views Asked by At

I use SnowballPorterFilterFactory for index and query analyzers. When i search for "profession" word. Solr successfully finds only articles that contains "profession", but i want "professional" "professionalism" ...

This is the current configuration on schema.xml

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.SnowballPorterFilterFactory" language="French"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>

    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.SnowballPorterFilterFactory" language="French"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    </analyzer>
</fieldType>
1

There are 1 answers

5
harmstyler On

What is happening is porter is over-stemming your query. When you search for profession your keyword gets stemmed down to profess, whereas profession professional and professionalism are all stored in the index as profession.

The only real way you are going to get around this is by adding another fieldType where you do not stem your query.

Something like:

<fieldType name="text_unstemmed_query" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.SnowballPorterFilterFactory" language="French"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    </analyzer>
</fieldType>

With a copyfield like:

<copyField source="your_text_field" dest="text_unstem_query_field"/>