"protected phrase" in Solr

2.1k views Asked by At

A customer of mine is a photo agency specialized in photojournalism (well, and gossip), so many of their customers' searches revolve around specific people.

We index about 1.5m documents, with full-text search on headline and caption; and full-text search without stemming on tags. We have a decent list of stop words, and they provide a list of protected words that they feel are not stemmed correctly. We are using Dismax to search over headline, caption and tags, with different boosts) This is all working pretty nicely.

However, a few people are proving tricky to get right. For instance, Al Gore. In Italian "al" is a stop word, so a simple query for `al gore' (without quotes) becomes:

+((DisjunctionMaxQuery((caption_text:gor | tags_text:gore^100.0 | headline_text:gor)))~1) ()

That does return hits for the ex VP, but of course also for "Lesley Gore" and "Tipper Gore"; and also, because of stemming, hits for "Gori" and more. Leaving aside sorting for a second, it does clutter up results, and I'd like to do better.

Wrapping the search terms in quotes doesn't help, "al" gets stripped away anyway. Marking "gore" as a protected word gets me halfway there, limiting the number of false positives. I tried playing with SynonymFilterFactory too, but didn't get too far--I have the SynonymFilterFactory as the first filter, so "al" gets removed anyway.

What I think I really need is a way of tokenizing "al gore" as a single token. Is there anything that will allow me to do that, for a set of configurable "phrases"? Is there another approach I'm overlooking? solr.CommonGramsFilterFactory perhaps?

Some more background info: we are using Solr 1.4.0. Relevant portions of schema.xml

<!-- used for headline and caption -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="tagsText" class="solr.TextField" sortMissingLast="true" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>   
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>  
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>
1

There are 1 answers

1
bdargan On

Have you looked into the CommonGramsFilterFactory It will:

  • combine multiple tokens into a single token
  • usually used when searching a phrase that contains stop words