Extend Endeca's diacritic folding mapping

280 views Asked by At

We have an index with mixed Greek, English data for an ATG-Endeca application. Indexed Greek data have words with accents. If the search terms are without accents they don't match to any data (or they match due to autoccorection that happens for the character without the accent to the character withthe accent and this is not desired functionality). Dgidx flag --diacritic folding configuration doesn't include mapping for Greek caracters (https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html).

Is it possible to extend this oob functionality thought a properties file in Endeca side or nucleous or code?

1

There are 1 answers

3
radimpe On

In the documentation you provide it states:

Dgidx supports mapping Latin1, Latin extended-A, and Windows CP1252 international characters to their simple ASCII equivalents during indexing.

This suggests that Greek is not supported since it doesn't fall into any of these character sets (I believe Greek is Latin-7). That said, you could try setting a language flag at a record level (since you indicate that your data includes both English and Greek) assuming that each language has its own record or try to implement a global language using the dgidx and dgraph parameters but this will affect things like stemming for records or properties not in the global language.

dgidx --lang el
dgraph --lang el

Though I'm not sure it will work based on the original statement.

Alternatively, you can implement a process of diacritic removal using a custom Accessor, which extends the atg.repository.search.indexing.PropertyAccessorImpl class (an option since you refer to Nucleus, so I assume you are using ATG/Oracle Commerce). Using this you specify a normalised searchable field in your index that duplicates the searchable fields in your current index but now with all diacritics removed. The same logic you apply in the Accessor then needs to be applied as a preprocessor on your search terms so that you normalise the input to match the indexed values. Lastly make your original fields in the index (with the accentuated characters) display-only and the normalised fields searchable (but don't display them).

The result will be matching your normalised text but the downside is you have duplicated data so your index will be bigger. Not a big issue with small data sets. There may also be an impact on how the OOTB functionality, like stemming, behaves with the normalised data set. You'll have to do some testing with various scenarios in Greek and English to see if the precision and recall is adversely affected.