I'm trying to cleanup a string by removing special characters to make a slug. That said, I want to keep CJK characters otherwise there's nothing left for these languages.
So I have a regex that is supposed to keep CJK characters by listing the scripts:
"[^-_.\\w-\\p{script=Han}\\p{script=Hira}\\p{script=Kana}\\p{script=Hang}]"
The problem is, the katakana prolonged sound mark "ー" seems to be excluded.
http://www.unicodemap.org/details/0x30FC/index.html
Here is the code showing the problem:
https://github.com/erwan/unicode-java-issue/blob/master/src/main/java/com/example/Hello.java
Is it not in the scripts I listed?
edit: ok, code here if you prefer, but it doesn't give much more information than the regex itself. It's mostly useful so people can try it.
package com.example;
class Hello {
public static void main(String[] args) {
String input = "%;アレルギー[]abcd";
String output= input.replaceAll("[^-_.\\w-\\p{script=Han}\\p{script=Hira}\\p{script=Kana}\\p{script=Hang}]", "");
System.out.println(output);
}
}
No, as a matter of fact, it is not in the scripts listed. The Unicode Standard places this character in the
Common
script.One should differentiate between "script" and "block" in Unicode. That character belongs to the Katakana block, along with a few other characters which are not letters such as the "Katakana iteration mark" (
\u30fd
). But it does not belong to the Katakana script. Only the actual syllables belong in the Katakana script.One thing you can do is replace the
script
indication toblock
forKatakana
:The output in this case would include the prolonged sound mark.
Or you can do it like this:
This pattern will match all word characters, in all languages, including, but not limited to Japanese.
For the input string
"%;アレルギー[]{}=abceⸯd漢字ру́сский"
, this will yieldWhereas my first suggestion, the one with the block, the output will be:
So if you just want to limit to Japanese (and Korean), my first suggestion might suit your better, whereas if you want all international word characters, the second will be better.