java.net.URLDecoder dependent on source file encoding?

Question

java.net.URLDecoder dependent on source file encoding?

747 views Asked by A.Grandt At 16 December 2013 at 12:06

I've come across a peculiar problem. My servlet receives an urlencoded string, and from the log I can tell that this string is correct.

I tried with this string:

"test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A"

which is the following:

"test  1 ♧ ♢ ♡ ♤ 《"

However when I run the test, I get the same result as I get on my server:

"test ? 1 ? ? ? ? ?"

Dumping the hex codes I get

00: 74 65 73 74 20 3F 20 31  20 3F 20 3F 20 3F 20 3F | test ? 1  ? ? ? ? 
10: 20 3F -- -- -- -- -- --  -- -- -- -- -- -- -- -- |  ?

Where I expected:

00: 74 65 73 74 20 F0 9F 98  8E 20 31 20 E2 99 A7 20 | test ... . 1 ... 
10: E2 99 A2 20 E2 99 A1 20  E2 99 A4 20 E3 80 8A -- | ... ...  ... ...

Now for the "interesting" bit. This happens on my server, and on my Eclipse IDE, but if I then save the source file in UTF-8, the URLDecoder returns the correct data! It didn't help on my server though.

1: I can't see how that can even be the case, URLDecoder should listen to the encoding requested. 2: I obviously need a replacement for the java.net.URLDecoder, if it does this, it is fundamentally broken. Any suggestions?

Test code:

public class URLDecoderTest {
    public static void main(String[] args) {
        String reqMsg = "test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A";
        System.out.println("reqMsg      : " + reqMsg);
        try {
            reqMsg = URLDecoder.decode(reqMsg, "UTF-8");
        } catch (UnsupportedEncodingException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        System.out.println("reqMsg      : " + reqMsg);
        System.out.println(HexTools.dump(reqMsg));
        System.out.println("Expected (fixed):");
        System.out.println("00: 74 65 73 74 20 F0 9F 98  8E 20 31 20 E2 99 A7 20 | test ... . 1 ... ");
        System.out.println("10: E2 99 A2 20 E2 99 A1 20  E2 99 A4 20 E3 80 8A -- | ... ...  ... ...");
    }
}

Note: HexTools is from Mobicents: http://code.google.com/p/mobicents/source/browse/trunk/commons/src/main/java/org/mobicents/commons/HexTools.java?r=21908

Edit: Looking at the source for URLDecoder.decode, it uses new String(bytes, 0, pos, enc) to decode the bytes. For some reason that fails, however for unicode, new String(bytes, 0, pos) works fine.

Is there a bug in Java's StringCoding class, that it automatically falls back to the "default" charset, regardless of what is passed to it? the decode method called by String is a static, and it sets the requested encoding in another static method, before calling the decode, which will then use this static. In other words: It is not threadsafe!!!

Update: I had problems in just about all layers of my implementations. The Emoji character (4-byte utf-8 characters) caused trouble on the MySQL for instance. I got asciified characters back from it, even if it was set to utf8.

Closing remark: Part of the problem, or perceived problem really, was caused by misuse of HexTools.dump(String), a class built to handle binary data, where even String's chars only contained data in the low byte.

For future reference, the call to HexTools.dump should have been:

        System.out.println(HexTools.dump(reqMsg.getBytes("UTF-8")));

with the catch block for the UnsupportedEncodingException moved down to cover that line of course. Doing that, returns a hex frame identical to the one expected.

Original Q&A

There are 2 answers

Joop Eggen On 16 December 2013 at 13:13

HexTools.dump must err. It is passed a String = Unicode text. So how can it dump bytes? Other than using the default platform encoding, probably Windows ANSI.

Try something like:

System.out.println(Arrays.toString(reqMsg.getBytes(StandardCharsets.UTF_8)));

You won't see a question mark (0x3F == 63).

**McDowell** · Accepted Answer · 2013-12-16T13:27:37+00:00

This code works as expected:

import java.io.IOException;
import java.net.URLDecoder;

public class Dump {
  public static void main(String[] args) throws IOException {
    String reqMsg = 
         "test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A";
    String decoded = URLDecoder.decode(reqMsg, "UTF-8");
    // UTF-16
    for (char ch : decoded.toCharArray()) {
      System.out.format("%04x ", (int) ch);
    }
    System.out.println();
    // UTF-8
    for (byte ch : decoded.getBytes("UTF-8")) {
      System.out.format("%02x ", 0xFF & ch);
    }
  }
}

However, you can lose information here:

System.out.println

The above PrintStream will perform a (potentially lossy) transcoding operation. From the documentation:

All characters printed by a PrintStream are converted into bytes using the platform's default character encoding.

On many systems, Java uses an obsolete legacy encoding.

It may also be the case that your servlet container is misconfigured. Not sure if it is true of the latest versions, but Tomcat has historically defaulted to ISO-8859-1 for URL encoding.

TechQA.

java.net.URLDecoder dependent on source file encoding?

There are 2 answers

Related Questions in JAVA

Related Questions in URL

Related Questions in ENCODING

Related Questions in UTF8-DECODE

Popular Questions

Popular Tags

Trending Questions