Problem Statement :-
Let's call 0x8140~0x84BE, 0x889F~0x9872, 0x989F~0x9FFC, 0xE040~0xEAA4, 0x8740~0x879C, 0xED40~0xEEFC, 0xFA40~0xFC4B, 0xF040~0xF9FC as range.
I want to validate if input String contains a kanji which is not in the the above range.
Here are examples of input Kanji characters not in the above range with output results :-
龔 --> OK
鑫 --> OK
璐 --> Need Change
Expected result should be "Need Change" for all of them. please help.
Here is a code :-
import java.io.UnsupportedEncodingException;
import java.util.regex.*;
//import java.util.regex.Pattern;
public class RegExpDemo2 {
private boolean validateMnpName(String name) {
try {
byte[] utf8Bytes = name.getBytes("UTF-8");
String string = new String(utf8Bytes, "UTF-8");
byte[] shiftJisBytes = string.getBytes("Shift-JIS");
String strName = new String(shiftJisBytes, "Shift-JIS");
System.out.println("ShiftJIS Str name : "+strName);
final String regex = "([\\x{8140}-\\x{84BE}]+)|([\\x{889F}-\\x{9872}]+)|([\\x{989F}-\\x{9FFC}]+)|([\\x{E040}-\\x{EAA4}]+)|([\\x{8740}-\\x{879C}]+)|([\\x{ED40}-\\x{EEFC}]+)|([\\x{FA40}-\\x{FC4B}]+)|([\\x{F040}-\\x{F9FC}]+)";
if (Pattern.compile(regex).matcher(strName).find()) {
return true;
} else
return false;
}
catch (Exception e) {
e.printStackTrace();
return false;
}
}
public static void main(String args[]) {
RegExpDemo2 obj = new RegExpDemo2();
if (obj.validateMnpName("ロ")) {
System.out.println("OK");
} else {
System.out.println("Need Change");
}
}
}
Your approach cannot work, because a String is Unicode in Java.
As observed by @VGR and myself, a round-trip through a Shift-JIS byte array does not change that. You simply converted Unicode to Shift-JIS and back to Unicode.
There are two approaches possible:
Convert the Java String (which is Unicode) into an array of bytes (in Shift-JIS encoding), and then examine the byte array for the allowed/forbidden values.
Convert the 'allowed' ranges into Unicode (and a single range in Shift-JIS may not be a single range in Unicode) and work with the String representation in Unicode.
Neither way seems pretty, but if you have to use old character codes instead of the not-quite-so-old (only 30 years!) Unicode, this is necessary.