eDismax queries with stopwords and language specific fields

708 views Asked by At

I have 3 text fields:

  • content_en
  • content_sp
  • content_fr

Each of the above fields has it's own set of analyzers, tokenizers and filters. They also have their own set of stopwords.

I use the LangIdentifierProcessor (https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing) to determine what language the indexed document is in, and Solr will write the content of that document to the correct field.

Finally, I use the eDisMax parser handling queries. My qf parameters map to the 3 fields above and the mm parameter is set to 100%.

Here is my issue: When I search with a query of 'Yellow House', Solr will return all documents with the terms Yellow and House. Great. Now, when I query with 'The Yellow House', I won't get anything back. After debugging for some time, I have found that Solr constructs a query similar to the following for 'The Yellow House': +((content_sp:the | content_fr:the)(content_en:yellow | content_sp:yellow | content_fr:yellow)(content_en:house | content_sp:house | content_fr:house))

Remember I have mm set to 100%, meaning all terms must be found in a document to be returned. Since the term 'the' is a stopword for my English field, Solr doesn't include it in the query against the content_en field, however it does include it in the query for my other two fields, which will obviously fail since these fields won't have anything in them for English documents. (Due to the LangIdProcessor explained in the link above).

Now - As a quick fix I suppose I could list all of my stopwords into a single file, however this is wrong. I also know I can specify my qf fields with each query, which would allow me to detect the query language and then specify the fields to search over. But can I do something in Solr to specify this (maybe some sort of SearchComponent)? Or is my multi-lingual approach incorrect?

1

There are 1 answers

0
Tim Cardwell On

This is my problem: https://issues.apache.org/jira/browse/SOLR-3085

It doesn't seem like there is a clear fix for this, so I am going to merge all of my stopwords together. (This might cause minor issues, but it is a large improvement from an empty result set).

The mm.autoRelax approach looks promising, however it is not currently implemented in Solr 4.10 (I know I'm behind).