I've come across a peculiar problem. My servlet receives an urlencoded string, and from the log I can tell that this string is correct.
I tried with this string:
"test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A"
which is the following:
"test 1 ♧ ♢ ♡ ♤ 《"
However when I run the test, I get the same result as I get on my server:
"test ? 1 ? ? ? ? ?"
Dumping the hex codes I get
00: 74 65 73 74 20 3F 20 31 20 3F 20 3F 20 3F 20 3F | test ? 1 ? ? ? ?
10: 20 3F -- -- -- -- -- -- -- -- -- -- -- -- -- -- | ?
Where I expected:
00: 74 65 73 74 20 F0 9F 98 8E 20 31 20 E2 99 A7 20 | test ... . 1 ...
10: E2 99 A2 20 E2 99 A1 20 E2 99 A4 20 E3 80 8A -- | ... ... ... ...
Now for the "interesting" bit. This happens on my server, and on my Eclipse IDE, but if I then save the source file in UTF-8, the URLDecoder returns the correct data! It didn't help on my server though.
1: I can't see how that can even be the case, URLDecoder should listen to the encoding requested. 2: I obviously need a replacement for the java.net.URLDecoder, if it does this, it is fundamentally broken. Any suggestions?
Test code:
public class URLDecoderTest {
public static void main(String[] args) {
String reqMsg = "test+%F0%9F%98%8E+1+%E2%99%A7+%E2%99%A2+%E2%99%A1+%E2%99%A4+%E3%80%8A";
System.out.println("reqMsg : " + reqMsg);
try {
reqMsg = URLDecoder.decode(reqMsg, "UTF-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("reqMsg : " + reqMsg);
System.out.println(HexTools.dump(reqMsg));
System.out.println("Expected (fixed):");
System.out.println("00: 74 65 73 74 20 F0 9F 98 8E 20 31 20 E2 99 A7 20 | test ... . 1 ... ");
System.out.println("10: E2 99 A2 20 E2 99 A1 20 E2 99 A4 20 E3 80 8A -- | ... ... ... ...");
}
}
Note: HexTools is from Mobicents: http://code.google.com/p/mobicents/source/browse/trunk/commons/src/main/java/org/mobicents/commons/HexTools.java?r=21908
Edit: Looking at the source for URLDecoder.decode, it uses new String(bytes, 0, pos, enc) to decode the bytes. For some reason that fails, however for unicode, new String(bytes, 0, pos) works fine.
Is there a bug in Java's StringCoding class, that it automatically falls back to the "default" charset, regardless of what is passed to it? the decode method called by String is a static, and it sets the requested encoding in another static method, before calling the decode, which will then use this static. In other words: It is not threadsafe!!!
Update: I had problems in just about all layers of my implementations. The Emoji character (4-byte utf-8 characters) caused trouble on the MySQL for instance. I got asciified characters back from it, even if it was set to utf8.
Closing remark: Part of the problem, or perceived problem really, was caused by misuse of HexTools.dump(String), a class built to handle binary data, where even String's chars only contained data in the low byte.
For future reference, the call to HexTools.dump should have been:
System.out.println(HexTools.dump(reqMsg.getBytes("UTF-8")));
with the catch block for the UnsupportedEncodingException moved down to cover that line of course. Doing that, returns a hex frame identical to the one expected.
This code works as expected:
However, you can lose information here:
The above PrintStream will perform a (potentially lossy) transcoding operation. From the documentation:
On many systems, Java uses an obsolete legacy encoding.
It may also be the case that your servlet container is misconfigured. Not sure if it is true of the latest versions, but Tomcat has historically defaulted to ISO-8859-1 for URL encoding.