I've been looking into internationalised resource identifiers and there's one thing bugging me.
My understanding is that, for each label in a domain name (xyzzy.plugh.com
has three labels, xyzzy
, plugh
and com
), the following process is performed to translate it into ASCII representation so that it can be processed okay by all legacy software:
- If it consists solely of ASCII characters, it's copied as is.
- Otherwise:
- First we output
xn--
followed by all the ASCII characters (skipping non-ASCII). - Then, if the final character isn't
-
, we output-
to separate the ASCII from non-ASCII. - Finally, we encode each of the non-ASCII characters using punycode so that they appear to be ASCII.
- First we output
My question then is: how do we distinguish between the following two Unicode URIs?
http://aa☃.net/
http://☃aa.net/
It seems to me that both of these will encode to:
http://xn--aa-nfh.net/
simply because the sequencing information has been lost for the label as a whole.
Or am I missing something in the specification?
According to one punycode encoder, there are encoded differently:
The relevant RFC 3492 details why this is the case. First, it provides clues in the introduction:
That means there must be differentiable one-to-one mapping for every single basic/extended string pair.
Understanding how it differentiates the two possibilities requires an understanding of the decoder (the thing that turns the basic string back into an extended one, with all its Unicode glory) works.
The decoder begins by starting with just the basic string
aa.net
with a pointer to the firsta
, then applies a series of deltas, such asgsx
oresx
.The delta actually encodes two things. The first is the number of non-insertions to be done and the second is the actual insertion.
So,
gsx
(the delta inaa☃.net
) would encode two non-insertions (to skip theaa
) followed by an insertion of☃
. Theesx
delta (for☃aa.net
) would encode zero non-insertions followed by an insertion of☃
.That is how position is encoded into the basic strings.