nhibernate.search / lucene.net multi-lingual analyser

349 views Asked by At

I am trying to integrate NHibernate.Search into a multi-lingual website. Now, this website contains a class Article which is multilingual. This is done by having a seperate class - Article_CultureInfo which stores the language-specific content. Fields of Article are

Article
-------
ID
Name

And Article_CultureInfo are:

Article_CultureInfo
-------
ID
ArticleId
CultureCode
PageTitle
Content

I am using Nhibernate.Search.Mapping to map out the field/document information. I would like to incorporate search features like stemming and synonym analysis where possible based on the language. Is there any way the Lucene Analyser can be specified at run-time, not compile time / initialisation?

Say we are analysing the content of PageTitle which is to be stored in the respective Lucene index - This content can be English, French, Italian, etc based on the value of CultureCode. Thus, the analyser should change based on this value. I have tried implementing a custom MultilingualAnalyser, however the only data available to me are the string to be analysed, i.e the value of PageTitle. From that only, I cannot deduce the language. (I could look into language detection techniques but that is out of the scope since I already know specifically what it is, and would be overkill and not 100% reliable.)

If I were to have apart from the tokens, an instance of the object, I could be able to get the CultureCode value out of it, and analyse accordingly. Any ideas would be greatly appreciated - I really wish to avoid using Lucene.Net directly since NHibernate.Search looks to integrate very nicely.

Thanks!

1

There are 1 answers

0
Karl Cassar On BEST ANSWER

I've basically done a work-around for this method - Quite an overkill but works.

I've created a new implementation of IGetter, which is used for multilingual properties, which I called MultilingualGetter. This is basically the same as the BasicGetter - I couldn't extend from it as for some reason it is sealed, so I copied the code.

What this IGetter does is: When the Get() method is called on it, it is given the target object. This is the instance of the class that contains the property. I check that it implements an interface for multilingual objects which I've created, IMultilingualContentInfo. It then retrieves the current culture from the IMultilingualContentInfo, and appends it on the front of the actual text, e.g [en]Hello World!.

This text is then passed on to a custom analyzer I created which parses the culture as well, and can deduce what it is. It is then using a SnowballFilter to stem the text based on the language.

Below is the code for Get() method of the custom IGetter implementation - IMultilingualContentInfo

    /// <summary>
    /// Gets the value of the Property from the object.
    /// </summary>
    /// <param name="target">The object to get the Property value from.</param>
    /// <returns>
    /// The value of the Property for the target.
    /// </returns>
    public object Get(object target)
    {

        if (target is IMultilingualContentInfo)
        {
            try
            {
                IMultilingualContentInfo multiLingualTarget = (IMultilingualContentInfo)target;
                string s = (string)property.GetValue(target, new object[0]);
                if (!string.IsNullOrWhiteSpace(s))
                {
                    MultilingualLuceneTextContent mlText = new MultilingualLuceneTextContent();
                    mlText.Culture = multiLingualTarget.CultureInfo.GetCultureCode();
                    s = mlText.GetTextIncCulture();

                }
                return s;
            }
            catch (Exception e)
            {
                throw new PropertyAccessException(e, "Exception occurred", false, clazz, propertyName);
            }
        }
        else
        {
            throw new InvalidOperationException("Multilingual Getter is only available on IMultilingualContentInfo objects");
        }

    }