Solr (Open Solr) suggester results contain punctuation marks

Question

Solr (Open Solr) suggester results contain punctuation marks

771 views Asked by Mark Robson At 27 November 2014 at 11:39

I'm working on a suggester and the results I'm gettig back contain punctuation. For example, when I type "Volcan" I get:

"volcanoes", "volcanic", "volcano", "volcano,", <- comma "volcanoes." <- period/full stop

Here is the code in the solrconfig.xml file:

<searchComponent class="solr.SpellCheckComponent" name="suggest">
  <lst name="spellchecker">
    <str name="name">suggest</str>
    <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
    <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
    <str name="field">text</str>
    <float name="threshold">0.005</float>
    <str name="buildOnCommit">true</str>
  </lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="spellcheck">true</str>
    <str name="spellcheck.dictionary">suggest</str>
    <str name="spellcheck.onlyMorePopular">true</str>
    <str name="spellcheck.count">5</str>
    <str name="spellcheck.collate">true</str>
  </lst>
  <lst name="invariants">
      <!-- always run the Suggester for queries to this handler -->
      <str name="spellcheck">true</str>
      <!-- collate not needed, query if tokenized as keyword, we need only suggestions for that term -->
      <str name="spellcheck.collate">false</str>
  </lst>
  <arr name="components">
    <str>suggest</str>
  </arr>
</requestHandler>

In the schema.xml file I have this:

<fieldType name="spell" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false" multiValued="true" termVectors="true" termPositions="true" termOffsets="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.ShingleFilterFactory"
                    minShingleSize="2"
                    maxShingleSize="4"
                    outputUnigrams="true"
                    outputUnigramsIfNoShingles="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

And the result is:

{
    "responseHeader": {
        "status": 0,
        "QTime": 0,
        "params": {
            "wt": "json",
            "q": "volcan"
        }
    },
    "spellcheck": {
        "suggestions": [
            "volcan",
            {
                "numFound": 5,
                "startOffset": 0,
                "endOffset": 6,
                "suggestion": [
                    "volcanoes",
                    "volcanic",
                    "volcano",
                    "volcano,",
                    "volcanoes."
                ]
            }
        ]
    }
}

Original Q&A

There are 1 answers

**Opensolr.com** · Answer 1 · 2015-05-06T09:37:37+00:00

The problem is not really on your requestHandler... It rather, seems to reside in the way you're indexing the files that go into the spell field, and maybe the spell field it's self. I'm thinking you should enable a tokenizer that strips out the punctuation from those fields.

Here's the spell field definition that works for me in schema.xml

<fieldType name="spell" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

TechQA.

Solr (Open Solr) suggester results contain punctuation marks

There are 1 answers

Related Questions in SOLR

Related Questions in SEARCH-SUGGESTION

Popular Questions

Trending Questions