How to decide which Encoder to use for which language in Elasticsearch "Phonetic Token filter"?

Question

How to decide which Encoder to use for which language in Elasticsearch "Phonetic Token filter"?

1.5k views Asked by Abhinav Keshri At 28 March 2020 at 05:24

I have used Metaphone and soundex Encoder with "Phonetic Token Filter" in Elasticsearch.

Metaphone is good for English words.

Soundex is good for English as well as Hindi maybe many other languages as well.

I want to know which of these encoders is best optimized for Hindi and if possible other Indian languages?

Soundex
Metaphone
double_metaphone
refined_soundex
caverphone1 - English (New Zealand localised)
caverphone2 - English (New Zealand localised)
cologne - German
nysiis - Improvized Soundex
koelnerphonetik - German
haasephonetik - German
beider_morse - English and multiple European Language
daitch_mokotoff - Slavic & Yiddish Surname

As This is not listed on Elasticsearch website for which Language we should choose which Encoder.

Also tell me which of the Encoders have you already used and for which language.

Original Q&A

There are 1 answers

**jaspreet chahal** · Accepted Answer · 2020-03-28T05:49:34+00:00

Phonetic encoders are alogorithms for indexing words by their pronunciation.

Explanation for this is available on wikipedia

Metaphone, Double Metaphone, and Metaphone 3 : suitable for use with most English words, not just names. Metaphone algorithms are the basis for many popular spell checkers. The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm.

Soundex: which was developed to encode surnames for use in censuses. Soundex codes are four-character strings composed of a single letter followed by three numbers.

Daitch–Mokotoff Soundex: which is a refinement of Soundex designed to better match surnames of Slavic and Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six numeric digits.

Cologne phonetics :This is similar to Soundex, but more suitable for German words.

New York State Identification and Intelligence System (NYSIIS): which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding.

Match Rating Approach developed by Western Airlines in 1977: this algorithm has an encoding and range comparison technique.

Caverphone: created to assist in data matching between late 19th century and early 20th century electoral rolls, optimized for accents present in parts of New Zealand

References: Details of above algorithms and their subtypes us available in below wikipedia page 1. https://en.wikipedia.org/wiki/Phonetic_algorithm

Among above SoundEx is most suitable for Indian languages You can check below resources for same 1. Phonetic search for Indian languages 2. https://thottingal.in/blog/2009/07/26/indicsoundex/

TechQA.

How to decide which Encoder to use for which language in Elasticsearch "Phonetic Token filter"?

There are 1 answers

Related Questions in ELASTICSEARCH

Related Questions in PHONETICS

Related Questions in METAPHONE

Popular Questions

Trending Questions