I want to see how phonetically similar two non-English strings are, AFAIK soundex and metaphone implementations only work correctly for English based strings, for instance coração
and corassão
sound exactly the same in Portuguese but metaphone()
returns KR
and KRS
. The same thing happens with other phonemes, chita
and xita
returns XT
and ST
, but they sound the same.
I've also tried this Double Metaphone implementation (demo) but the results are exactly the same.
So, is there any alternative algorithm that works with Portuguese words? I've read about Lucene in this other question, but I've never used it before and I'm not sure how it works or how to use it.
If not, does anyone know what kind of data I need to gather to develop a metaphone-like algorithm?
In case anyone is interested, I found a promising work-in-progress here and some other cool projects.