"Sunspot" Gem makes distinction between UTF-8 chars

167 views Asked by At

In a Rails app I started using sunspot => https://github.com/sunspot/sunspot/blob/master/README.md

Everything went OK until I noticed this (taken from the rails-console):

1.9.3p194 :002 > MyModel.search{fulltext "leon"}.results
=> [#<MyModel id: 16, name: "Leon">]
1.9.3p194 :003 > MyModel.search{fulltext "león"}.results
=> [#<MyModel id: 18, name: "León">]

How can I tell the system not to make distinction between "leon" and "león" (I want smth like search{fulltext "leon"} => [#MyModel id: 16 ... , #MyModel id: 18...])

I've been looking for this problem and I've found every time the same response:

With this line in Gemfile works meanwhile the next release of rsolr: gem 'rsolr', :git => "https://github.com/mwmitchell/rsolr.git"

thx

3

There are 3 answers

0
Manu Artero On BEST ANSWER

Thx for the responses. At least I've solved it right last night with anohter idea I've taked from http://codeshooter.wordpress.com/2011/01/13/full-text-search-in-in-rails-with-sunspot-and-solr/

the idea is in Restaurant.rb

text :name do 
  self.name.my_normalize
end

and the function

to_s.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/,'').downcase

that line works with strings like "äáàÁÄÀ" --- "aaaaaa"

0
Blacksad On

You need to make changes inside the Solr (the application, not the gem) configuration files. Solr is "embedded" in the gem, but you can access its configuration as if it were installed separately. Have a look at Solr documentation.

0
vvlad On

in the schema.xml you need to add a character filter as described in AnalyzersTokenizersTokenFilters for example:

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

and in the you should have mapping-ISOLatin1Accent.txt you should have entries that will map the unicode byte sequence to a asci character sequence. You can see an example here mapping-ISOLatin1Accent.txt