uconv - Does the -x option define a transliterator or a transform?

710 views Asked by At

The man pages for uconv say:

-x transliteration
Run the given transliteration on the transcoded Unicode data, and use the transliterated data as input for the transcoding to the the destination encoding.

It also includes the following two examples:

echo '\u30ab' | uconv -x 'hex-any; any-name'

uconv -f utf-8 -t utf-8 -x '::nfkc; [:Cc:] >; ::katakana-hiragana;'

The first example points towards the -x option defining a "compound transform" but the second example points to it being a "rule-based transliterator".

This is exacerbated by the fact that many of ICU's provided examples (1, 2) don't work:

$ echo "Example" | uconv -f UTF8 -t UTF8 -x 'NFD; [:Nonspacing Mark:] Remove; NFC;'
Couldn't create transliteration "NFD; [:Nonspacing Mark:] Remove; NFC;": U_MISSING_OPERATOR, line 0, offset 0.

$ echo "Example" | uconv -f UTF8 -t UTF8 -x '[:Latin:]; NFKD; Lower; Latin-Katakana;'
Couldn't create transliteration "[:Latin:]; NFKD; Lower; Latin-Katakana;": U_MISSING_OPERATOR, line 0, offset 0.

But some examples (1, 2) work just fine:

$ echo "Example" | uconv -f UTF8 -t UTF8 -x '[aeiou] Upper'
ExAmplE

$ echo "Example" | uconv -f UTF8 -t UTF8 -x 'NFKD; Lower; Latin-Katakana;'
エクサンプレ

So what the heck does -x define?


The plot thickens! It looks like uconv chokes on predefined character classes that aren't in a transform rule.

Regular character classes:

$ echo "Example" | uconv -f UTF8 -t UTF8 -x '[a-zA-Z] Upper'
EXAMPLE

$ echo "Example" | uconv -f UTF8 -t UTF8 -x ':: [a-zA-Z] Upper;'
EXAMPLE

Predefined character classes:

$ echo "Example" | uconv -f UTF8 -t UTF8 -x '[:alpha:] Upper'
Couldn't create transliteration "[:alpha:] Upper": U_MISSING_OPERATOR, line 0, offset 0.

$ echo "Example" | uconv -f UTF8 -t UTF8 -x ':: [:alpha:] Upper;'
EXAMPLE

Just in case, here's the version of uconv I'm using:

$ uconv --version
uconv v2.1  ICU 58.1
1

There are 1 answers

1
Grisha Levit On

It does different things depending on what you pass.

The excerpt below is formatted code from uconv.cpp. translit is the value of the -x argument.

UnicodeString str(translit), pestr;

/* Create from rules or by ID as needed. */

parse.line = -1;

if (uprv_strchr(translit, ':') || uprv_strchr(translit, '>') ||
    uprv_strchr(translit, '<') || uprv_strchr(translit, '>')) {
  t = Transliterator::createFromRules(UNICODE_STRING_SIMPLE("Uconv"), str,
                                      UTRANS_FORWARD, parse, err);
} else {
  t = Transliterator::createInstance(UnicodeString(translit, -1, US_INV),
                                     UTRANS_FORWARD, err);
}

And createFromRules further differs in what it creates based on the input:

Returns a Transliterator object constructed from the given rule string. This will be a RuleBasedTransliterator, if the rule string contains only rules, or a CompoundTransliterator, if it contains ID blocks, or a NullTransliterator, if it contains ID blocks which parse as empty for the given direction.