Solr Dismax handler - whitespace and special character behaviour

Question

Solr Dismax handler - whitespace and special character behaviour

5.7k views Asked by Romain Meresse At 25 October 2011 at 10:21

I've got strange results when I have special characters in my query.

Here is my request :

q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

Parsed query :

<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>

I've got 17000 results because Solr is doing an OR (should be AND).

I have no problem when I'm using a whitespace instead of a special char :

q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>

2000 results for this query.

Here is my schema.xml (relevant parts) :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

I even tried with a PatternTokenizerFactory to tokenize on whitespaces & special chars but no change...

My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying.

EDIT : Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work...

First line of analysis via solr admin, with verbose output, for query = 'histoire-france' :

org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text    histoire france

The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'.

Did i miss something ?

Original Q&A

There are 4 answers

Jayendra On 25 October 2011 at 18:22

Enable the autoGeneratePhraseQueries to true and this would generate the phrase queries.
So when searched for histoire-franc, it would generate a query with quotes which will enable only the documents having both words as a phrase being matched.

<str name="parsedquery">(+DisjunctionMaxQuery(((any:histoire any:franc))))/no_coord</str>

Example working configuration -

<fieldType name="text_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Use query slop to specify the number of slops e.g. qs=10 in a phrase query.

<str name="parsedquery">(+DisjunctionMaxQuery((any:"histoire france"~10)))/no_coord</str>

The Bndr On 25 October 2011 at 15:02

using WhitespaceTokenizerFactory, Solr will split your query string into words.

But, after tokenizing you(Solr) split your word (again) into terms using solr.WordDelimiterFilterFactory. Look at the documentation and look at the Wi-Fi example.

That could be one reason, why histoire france and histoire-france are handled different.

2nd: don't forget, that the DSIMAX handles (normally) the query-term as "term" and also (additional) as parsed string again.

To solve your problem, you could try to avoid the world delimiter and try to handle "tokenizing" by using PatternTokenizerFactory (as you tried before, but now without WordDelimiterFilterFactory).

If that doesn't work, try to post the complete output of the analysys.jsp

Grimmo On 06 February 2012 at 18:28

You get different number of results searching for 'histoire-france' and 'histoire france' because query parser creates a phrase query in the first case, and a boolean query (separate two words) in the second case.

This is not obvious behavior imho, but i believe it's hard to satisfy all use cases.

To make search treating 'histoire-france' as simply two words you can add "solr.PositionFilterFactory" to the end of query analyzer like:

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />
  </analyzer>

Then search results for 'histoire-france' and 'histoire france' will be equal.

Note that position filter can be undesired for phrase searches (both 'historie' and 'france' to be present). Consider using of query slops parameter qs > 0 instead in case you have modified term sequence with say NGram filter.

**Romain Meresse** · Accepted Answer · 2013-01-24T09:38:42+00:00

It was a bug : https://issues.apache.org/jira/browse/SOLR-3589

With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced. This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.

It is fixed in Solr 4.1 (22 January 2013)

TechQA.

Solr Dismax handler - whitespace and special character behaviour

There are 4 answers

Related Questions in SOLR

Related Questions in LUCENE

Related Questions in TOKENIZE

Related Questions in DISMAX

Popular Questions

Popular Tags

Trending Questions