Unicode Japanese prolonged sound mark excluded from Kana script?

253 views Asked by At

I'm trying to cleanup a string by removing special characters to make a slug. That said, I want to keep CJK characters otherwise there's nothing left for these languages.

So I have a regex that is supposed to keep CJK characters by listing the scripts:

"[^-_.\\w-\\p{script=Han}\\p{script=Hira}\\p{script=Kana}\\p{script=Hang}]"

The problem is, the katakana prolonged sound mark "ー" seems to be excluded.

http://www.unicodemap.org/details/0x30FC/index.html

Here is the code showing the problem:

https://github.com/erwan/unicode-java-issue/blob/master/src/main/java/com/example/Hello.java

Is it not in the scripts I listed?

edit: ok, code here if you prefer, but it doesn't give much more information than the regex itself. It's mostly useful so people can try it.

package com.example;

class Hello {
    public static void main(String[] args) {
        String input = "%;アレルギー[]abcd";
        String output= input.replaceAll("[^-_.\\w-\\p{script=Han}\\p{script=Hira}\\p{script=Kana}\\p{script=Hang}]", "");
        System.out.println(output);
    }
}
2

There are 2 answers

2
RealSkeptic On BEST ANSWER

No, as a matter of fact, it is not in the scripts listed. The Unicode Standard places this character in the Common script.

One should differentiate between "script" and "block" in Unicode. That character belongs to the Katakana block, along with a few other characters which are not letters such as the "Katakana iteration mark" (\u30fd). But it does not belong to the Katakana script. Only the actual syllables belong in the Katakana script.

One thing you can do is replace the script indication to block for Katakana:

output = input.replaceAll("[^-_.\\w-\\p{script=Han}\\p{script=Hira}\\p{block=Katakana}\\p{script=Hang}]", "");

The output in this case would include the prolonged sound mark.

Or you can do it like this:

Matcher m = Pattern.compile("[^-_.\\w]",Pattern.UNICODE_CHARACTER_CLASS).matcher(input);
output = m.replaceAll("");

This pattern will match all word characters, in all languages, including, but not limited to Japanese.

For the input string "%;アレルギー[]{}=abceⸯd漢字ру́сский", this will yield

アレルギーabceⸯd漢字ру́сский

Whereas my first suggestion, the one with the block, the output will be:

アレルギーabced漢字

So if you just want to limit to Japanese (and Korean), my first suggestion might suit your better, whereas if you want all international word characters, the second will be better.

0
Wiktor Stribiżew On

In order to avoid matching that character, you should add it to the negated class.

"[^-_ー.\\w-\\p{script=Han}\\p{script=Hira}\\p{script=Kana}\\p{script=Hang}]"