java.net.URLDecoder dependent on source file encoding?

722 views Asked by At

I've come across a peculiar problem. My servlet receives an urlencoded string, and from the log I can tell that this string is correct.

I tried with this string:

"test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A"

which is the following:

"test  1 ♧ ♢ ♡ ♤ 《"

However when I run the test, I get the same result as I get on my server:

"test ? 1 ? ? ? ? ?"

Dumping the hex codes I get

00: 74 65 73 74 20 3F 20 31  20 3F 20 3F 20 3F 20 3F | test ? 1  ? ? ? ? 
10: 20 3F -- -- -- -- -- --  -- -- -- -- -- -- -- -- |  ?                

Where I expected:

00: 74 65 73 74 20 F0 9F 98  8E 20 31 20 E2 99 A7 20 | test ... . 1 ... 
10: E2 99 A2 20 E2 99 A1 20  E2 99 A4 20 E3 80 8A -- | ... ...  ... ...

Now for the "interesting" bit. This happens on my server, and on my Eclipse IDE, but if I then save the source file in UTF-8, the URLDecoder returns the correct data! It didn't help on my server though.

1: I can't see how that can even be the case, URLDecoder should listen to the encoding requested. 2: I obviously need a replacement for the java.net.URLDecoder, if it does this, it is fundamentally broken. Any suggestions?

Test code:

public class URLDecoderTest {
    public static void main(String[] args) {
        String reqMsg = "test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A";
        System.out.println("reqMsg      : " + reqMsg);
        try {
            reqMsg = URLDecoder.decode(reqMsg, "UTF-8");
        } catch (UnsupportedEncodingException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        System.out.println("reqMsg      : " + reqMsg);
        System.out.println(HexTools.dump(reqMsg));
        System.out.println("Expected (fixed):");
        System.out.println("00: 74 65 73 74 20 F0 9F 98  8E 20 31 20 E2 99 A7 20 | test ... . 1 ... ");
        System.out.println("10: E2 99 A2 20 E2 99 A1 20  E2 99 A4 20 E3 80 8A -- | ... ...  ... ...");
    }
}

Note: HexTools is from Mobicents: http://code.google.com/p/mobicents/source/browse/trunk/commons/src/main/java/org/mobicents/commons/HexTools.java?r=21908

Edit: Looking at the source for URLDecoder.decode, it uses new String(bytes, 0, pos, enc) to decode the bytes. For some reason that fails, however for unicode, new String(bytes, 0, pos) works fine.

Is there a bug in Java's StringCoding class, that it automatically falls back to the "default" charset, regardless of what is passed to it? the decode method called by String is a static, and it sets the requested encoding in another static method, before calling the decode, which will then use this static. In other words: It is not threadsafe!!!

Update: I had problems in just about all layers of my implementations. The Emoji character (4-byte utf-8 characters) caused trouble on the MySQL for instance. I got asciified characters back from it, even if it was set to utf8.

Closing remark: Part of the problem, or perceived problem really, was caused by misuse of HexTools.dump(String), a class built to handle binary data, where even String's chars only contained data in the low byte.

For future reference, the call to HexTools.dump should have been:

        System.out.println(HexTools.dump(reqMsg.getBytes("UTF-8")));

with the catch block for the UnsupportedEncodingException moved down to cover that line of course. Doing that, returns a hex frame identical to the one expected.

2

There are 2 answers

4
McDowell On BEST ANSWER

This code works as expected:

import java.io.IOException;
import java.net.URLDecoder;

public class Dump {
  public static void main(String[] args) throws IOException {
    String reqMsg = 
         "test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A";
    String decoded = URLDecoder.decode(reqMsg, "UTF-8");
    // UTF-16
    for (char ch : decoded.toCharArray()) {
      System.out.format("%04x ", (int) ch);
    }
    System.out.println();
    // UTF-8
    for (byte ch : decoded.getBytes("UTF-8")) {
      System.out.format("%02x ", 0xFF & ch);
    }
  }
}

However, you can lose information here:

System.out.println

The above PrintStream will perform a (potentially lossy) transcoding operation. From the documentation:

All characters printed by a PrintStream are converted into bytes using the platform's default character encoding.

On many systems, Java uses an obsolete legacy encoding.

It may also be the case that your servlet container is misconfigured. Not sure if it is true of the latest versions, but Tomcat has historically defaulted to ISO-8859-1 for URL encoding.

5
Joop Eggen On

HexTools.dump must err. It is passed a String = Unicode text. So how can it dump bytes? Other than using the default platform encoding, probably Windows ANSI.

Try something like:

System.out.println(Arrays.toString(reqMsg.getBytes(StandardCharsets.UTF_8)));

You won't see a question mark (0x3F == 63).