Solr Dismax handler - whitespace and special character behaviour

I've got strange results when I have special characters in my query.

Here is my request :


Parsed query :

<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>

I've got 17000 results because Solr is doing an OR (should be AND).

I have no problem when I'm using a whitespace instead of a special char :

q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>

2000 results for this query.

Here is my schema.xml (relevant parts) :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>

I even tried with a PatternTokenizerFactory to tokenize on whitespaces & special chars but no change...

My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying.

EDIT : Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work...

First line of analysis via solr admin, with verbose output, for query = 'histoire-france' :

org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text    histoire france

The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'.

Did i miss something ?


Romain Meresse On BEST ANSWER

It was a bug :

With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced. This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.

It is fixed in Solr 4.1 (22 January 2013)

Jayendra On

Enable the autoGeneratePhraseQueries to true and this would generate the phrase queries.
So when searched for histoire-franc, it would generate a query with quotes which will enable only the documents having both words as a phrase being matched.

<str name="parsedquery">(+DisjunctionMaxQuery(((any:histoire any:franc))))/no_coord</str>

Example working configuration -

<fieldType name="text_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>

Use query slop to specify the number of slops e.g. qs=10 in a phrase query.

<str name="parsedquery">(+DisjunctionMaxQuery((any:"histoire france"~10)))/no_coord</str>
The Bndr On

using WhitespaceTokenizerFactory, Solr will split your query string into words.

But, after tokenizing you(Solr) split your word (again) into terms using solr.WordDelimiterFilterFactory. Look at the documentation and look at the Wi-Fi example.

That could be one reason, why histoire france and histoire-france are handled different.

2nd: don't forget, that the DSIMAX handles (normally) the query-term as "term" and also (additional) as parsed string again.

To solve your problem, you could try to avoid the world delimiter and try to handle "tokenizing" by using PatternTokenizerFactory (as you tried before, but now without WordDelimiterFilterFactory).

If that doesn't work, try to post the complete output of the analysys.jsp

Grimmo On

You get different number of results searching for 'histoire-france' and 'histoire france' because query parser creates a phrase query in the first case, and a boolean query (separate two words) in the second case.

This is not obvious behavior imho, but i believe it's hard to satisfy all use cases.

To make search treating 'histoire-france' as simply two words you can add "solr.PositionFilterFactory" to the end of query analyzer like:

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />

Then search results for 'histoire-france' and 'histoire france' will be equal.

Note that position filter can be undesired for phrase searches (both 'historie' and 'france' to be present). Consider using of query slops parameter qs > 0 instead in case you have modified term sequence with say NGram filter.