Convert Japanese Kanji characters from Shift-JIS to UTF-8

3.9k views Asked by At

I'm trying to read CSV file with some Japanese text, and write some data from this file into DB. CSV is uploaded through some Flex code I'm not very comfortable with. But on my backend side I have simple byte[] with content of the file. I'm using the following code:

//content is an array of bytes, returned by Flex side
ByteArrayInputStream in = new ByteArrayInputStream(content);
BufferedReader br = new BufferedReader(new InputStreamReader(in, Const.ENCODING_SHIFT_JIS));
String strLine;
try {
    while (true) {
    strLine = br.readLine();
    //processing CSV line by line and eventually writing data to DB
...

When I'm debugging the strLine variable - I see only question marks instead of Kanji Japanese characters (in particular, I've tested it on Kanji character 裵). Other Japanese characters seems to be ok (for example 〒 character). In debug window (and later in my DB) it appears like this: 〒���

If I'm doing the same things, but have file encoding UTF-8 and Const.UTF-8 instead of Const.ENCODING_SHIFT_JIS in my code - everything works fine. But client needs Shift-JIS support. Maybe someone can tell me how to solve this issue, or at least in which particular area (flex, java, shift-jis encoding itself ...) it may be?

1

There are 1 answers

0
kumade On BEST ANSWER

After some researches and try\fail iterations I've noticed that if I'm specifying "JISAutoDetect" instead of "Shift-JIS" as a parameter for InputStreamReader - then all Kanji characters become readable.

From a description which I've found here, JISAutoDetect should do the following: "Detects and converts from Shift-JIS, EUC-JP, ISO 2022 JP (conversion to Unicode only)". So it's doing its job well.

And from there I can see few consequences:

1) From JISAutoDetect description I can assume, that it is theoretically possible, that the file-encoding I had - wasn't actually Shift-JIS. That's why I had all these garbled characters after reading data from file as Shift-JIS. If it was, for instance, EUC-JP, then JISAutoDetect detected this and converted everything correctly.

But I've obtained this file from client with Japanese version of Windows which should have native encoding Shift-JIS (at least my client asserts so). Also I've tried to convert the same characters stored in a file in UTF-8 encoding to Shift-JIS with online converting tool. This gave me the same garbled chars after passing through my code.

2) So, if everything above is correct, then there may be some bug with processing of Shift-JIS files in Java. Though it's very hard to believe in this.