Unicode normalization of homoglyphs to ASCII using Rust

353 views Asked by At

Given a homoglyph, I want a Rust function to convert it to the nearest ASCII character.

All of these Unicode "a"s

A Α А Ꭺ ᗅ ᴀ ꓮ A                   

should be converted to:

a a a a a a a a a a a a a a a a a a a a a a a a a a a a a

I tried this but it didn't work:

let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A                   ";
let normalized = input.nfc().collect::<String>(); // normalize using NFC
let result = normalized.to_lowercase(); // convert to lower case
println!("{}", result);

It output:

a α а ꭺ ᗅ ᴀ ꓮ a

2

There are 2 answers

0
Caesar On

I assume you use unicode_normalization::UnicodeNormalization; for .nfc()? (Always nice to mention these things.)

According to the relevant standard annex, that will only do "Canonical Decomposition, followed by Canonical Composition". From what I understand of the jargon, that means it will only change how grapheme clusters are represented by characters, but not how they're supposed to be rendered. What you want is probably the "Compatibility Decomposition", which, as indicated here, includes substitutions like ℌ → H. The Compatibility Decomposition is available through .nfkc() or .nfkd() in the unicode_normalization crate.

2
kmdreko On

The right tool will depend on the purpose of the transformation, but the Unicode standard does indicate these are "confusable" with "A".

You can try using the unicode-security crate and its skeleton() function which follows the Unicode security mechanisms for Confusable Detection. Using it yields this result:

fn main() {
    let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A                   ";
    let normalized = unicode_security::skeleton(input).collect::<String>();
    let result = normalized.to_lowercase(); // convert to lower case
    println!("{}", result);
}
a a a a a ᴀ a a a a a a a a a a a a a a a a a a a a a

The only outlier there is "ᴀ": U+1D00 (LATIN LETTER SMALL CAPITAL A). I don't know why it is distinct but I verified it is consistent with Unicode's confusables.txt mappings. Though it is confusable with "ꭺ": U+AB7A (CHEROKEE SMALL LETTER GO).


I have found the decancer crate that "removes common confusables from strings" and seems to use an expanded definition of "confusable". Here's how that would look:

fn main() {
    let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A                   ";
    let normalized = decancer::cure(input).into_str();
    println!("{}", normalized);
}
a a a a a a a a a a a a a a a a a a a a a a a a a a a

Note that it seems to automatically convert to lowercase. So your idea of "homoglyph" is to treat "a" and "A" the same, this may work for you.