Regular expression for matching "Shift-JIS" string against given set of ranges

877 views Asked by At

Problem Statement :-

Let's call 0x8140~0x84BE, 0x889F~0x9872, 0x989F~0x9FFC, 0xE040~0xEAA4, 0x8740~0x879C, 0xED40~0xEEFC, 0xFA40~0xFC4B, 0xF040~0xF9FC as range.

I want to validate if input String contains a kanji which is not in the the above range.

Here are examples of input Kanji characters not in the above range with output results :-

龔 --> OK

鑫 --> OK

璐 --> Need Change

Expected result should be "Need Change" for all of them. please help.

Here is a code :-

import java.io.UnsupportedEncodingException;
import java.util.regex.*;
//import java.util.regex.Pattern;

public class RegExpDemo2 {

    private boolean validateMnpName(String name)  {

        try {
            byte[] utf8Bytes = name.getBytes("UTF-8");
            String string = new String(utf8Bytes, "UTF-8");

            byte[] shiftJisBytes = string.getBytes("Shift-JIS");
            String strName = new String(shiftJisBytes, "Shift-JIS");

            System.out.println("ShiftJIS Str name : "+strName);

            final String regex = "([\\x{8140}-\\x{84BE}]+)|([\\x{889F}-\\x{9872}]+)|([\\x{989F}-\\x{9FFC}]+)|([\\x{E040}-\\x{EAA4}]+)|([\\x{8740}-\\x{879C}]+)|([\\x{ED40}-\\x{EEFC}]+)|([\\x{FA40}-\\x{FC4B}]+)|([\\x{F040}-\\x{F9FC}]+)";

            if (Pattern.compile(regex).matcher(strName).find()) {
                return true;
            } else
                return false;
        }
        catch (Exception e) {
            e.printStackTrace();
            return false;
        }

    }

    public static void main(String args[]) {

        RegExpDemo2 obj = new RegExpDemo2();

        if (obj.validateMnpName("ロ")) {
            System.out.println("OK");
        } else {
            System.out.println("Need Change");
        }

    }
}

1

There are 1 answers

2
user14644949 On

Your approach cannot work, because a String is Unicode in Java.

As observed by @VGR and myself, a round-trip through a Shift-JIS byte array does not change that. You simply converted Unicode to Shift-JIS and back to Unicode.

There are two approaches possible:

  1. Convert the Java String (which is Unicode) into an array of bytes (in Shift-JIS encoding), and then examine the byte array for the allowed/forbidden values.

  2. Convert the 'allowed' ranges into Unicode (and a single range in Shift-JIS may not be a single range in Unicode) and work with the String representation in Unicode.

Neither way seems pretty, but if you have to use old character codes instead of the not-quite-so-old (only 30 years!) Unicode, this is necessary.