Punycode for Unicode query parameter

2.2k views Asked by At

I am trying encode some Unicode URLs with Punycode. These URLs have a query parameter that contains non-ASCII characters, for example:

https://en.wiktionary.org/w/index.php?title=Clœlia&printable=yes

The problem is, when I try to do it in Java, the resulting URL is wrong:

String link = "https://en.wiktionary.org/w/index.php?title=Clœlia&printable=yes";
link = IDN.toASCII(link);

// -> link = http://en.wiktionary.org/w/index.xn--php?title=cllia&printable=yes-hgf

If I do it this way, the resulting string is different (I don't know why), but is also wrong:

String link = "http://en.wiktionary.org/w/index.php?title=" + IDN.toASCII("Clœlia") + "&printable=yes";

// -> link = http://en.wiktionary.org/w/index.php?title=xn--cllia-ibb&printable=yes

If I copy the address from Chrome and paste it here, I get this URL, which is what I want:

https://en.wiktionary.org/w/index.php?title=Cl%C5%93lia&printable=yes

What did I do wrong here?

1

There are 1 answers

0
dave_thompson_085 On BEST ANSWER

What you did wrong is use punycode. Punycode is used for domain names, including the domain-name part of a URL, only.

Other parts of a URL, including the query-parameter part, use Percent Encoding also known as URL encoding or URI encoding, and that is what Chrome is doing; this encodes non-ASCII Unicode characters in UTF-8, and then all octets that aren't in a limited subset of ASCII using a percent-sign (%) and two hex digits; the octets 80-FF used by UTF-8 for non-ASCII are always %-encoded. To be exact the query-parameter part usually and other parts sometimes use a slight variant defined for HTML form submission as application/x-www-form-urlencoded; this encodes space as plus-sign '+' instead of %20, which is unambiguous because '+' is already in the unsafe set thus encoded as %2B.

In Java use java.net.URLEncoder.encode and java.net.URLDecoder.decode for this; for reliable results use the newer 2-arg forms with encoding name "UTF-8".