Regular expression for matching "Shift-JIS" string against given set of ranges

Question

Regular expression for matching "Shift-JIS" string against given set of ranges

934 views Asked by Mahesh Jadhav At 26 November 2020 at 14:36

Problem Statement :-

Let's call 0x8140～0x84BE, 0x889F～0x9872, 0ｘ989F～0x9FFC, 0xE040～0xEAA4, 0x8740～0x879C, 0xED40～0xEEFC, 0xFA40～0xFC4B, 0xF040～0xF9FC as range.

I want to validate if input String contains a kanji which is not in the the above range.

Here are examples of input Kanji characters not in the above range with output results :-

龔 --> OK

鑫 --> OK

璐 --> Need Change

Expected result should be "Need Change" for all of them. please help.

Here is a code :-

import java.io.UnsupportedEncodingException;
import java.util.regex.*;
//import java.util.regex.Pattern;

public class RegExpDemo2 {

    private boolean validateMnpName(String name)  {

        try {
            byte[] utf8Bytes = name.getBytes("UTF-8");
            String string = new String(utf8Bytes, "UTF-8");

            byte[] shiftJisBytes = string.getBytes("Shift-JIS");
            String strName = new String(shiftJisBytes, "Shift-JIS");

            System.out.println("ShiftJIS Str name : "+strName);

            final String regex = "([\\x{8140}-\\x{84BE}]+)|([\\x{889F}-\\x{9872}]+)|([\\x{989F}-\\x{9FFC}]+)|([\\x{E040}-\\x{EAA4}]+)|([\\x{8740}-\\x{879C}]+)|([\\x{ED40}-\\x{EEFC}]+)|([\\x{FA40}-\\x{FC4B}]+)|([\\x{F040}-\\x{F9FC}]+)";

            if (Pattern.compile(regex).matcher(strName).find()) {
                return true;
            } else
                return false;
        }
        catch (Exception e) {
            e.printStackTrace();
            return false;
        }

    }

    public static void main(String args[]) {

        RegExpDemo2 obj = new RegExpDemo2();

        if (obj.validateMnpName("ロ")) {
            System.out.println("OK");
        } else {
            System.out.println("Need Change");
        }

    }
}

Original Q&A

There are 1 answers

**user14644949** · Answer 1 · 2020-11-27T13:42:47+00:00

Your approach cannot work, because a String is Unicode in Java.

As observed by @VGR and myself, a round-trip through a Shift-JIS byte array does not change that. You simply converted Unicode to Shift-JIS and back to Unicode.

There are two approaches possible:

Convert the Java String (which is Unicode) into an array of bytes (in Shift-JIS encoding), and then examine the byte array for the allowed/forbidden values.
Convert the 'allowed' ranges into Unicode (and a single range in Shift-JIS may not be a single range in Unicode) and work with the String representation in Unicode.

Neither way seems pretty, but if you have to use old character codes instead of the not-quite-so-old (only 30 years!) Unicode, this is necessary.

TechQA.

Regular expression for matching "Shift-JIS" string against given set of ranges

There are 1 answers

Related Questions in JAVA

Related Questions in REGEX

Related Questions in UTF-8

Related Questions in SHIFT-JIS

Popular Questions

Trending Questions