How does punycode distinguish similar IRIs?

68 views Asked by At

I've been looking into internationalised resource identifiers and there's one thing bugging me.

My understanding is that, for each label in a domain name (xyzzy.plugh.com has three labels, xyzzy, plugh and com), the following process is performed to translate it into ASCII representation so that it can be processed okay by all legacy software:

  • If it consists solely of ASCII characters, it's copied as is.
  • Otherwise:
    • First we output xn-- followed by all the ASCII characters (skipping non-ASCII).
    • Then, if the final character isn't -, we output - to separate the ASCII from non-ASCII.
    • Finally, we encode each of the non-ASCII characters using punycode so that they appear to be ASCII.

My question then is: how do we distinguish between the following two Unicode URIs?

http://aa☃.net/
http://☃aa.net/

It seems to me that both of these will encode to:

http://xn--aa-nfh.net/

simply because the sequencing information has been lost for the label as a whole.

Or am I missing something in the specification?

1

There are 1 answers

1
brunesto On BEST ANSWER

According to one punycode encoder, there are encoded differently:

aa☃.net -> xn--aa-gsx.net
☃aa.net -> xn--aa-esx.net
                  ^
                  see here

The relevant RFC 3492 details why this is the case. First, it provides clues in the introduction:

Uniqueness: There is at most one basic string that represents a given extended string.

Reversibility: Any extended string mapped to a basic string can be recovered from that basic string.

That means there must be differentiable one-to-one mapping for every single basic/extended string pair.

Understanding how it differentiates the two possibilities requires an understanding of the decoder (the thing that turns the basic string back into an extended one, with all its Unicode glory) works.

The decoder begins by starting with just the basic string aa.net with a pointer to the first a, then applies a series of deltas, such as gsx or esx.

The delta actually encodes two things. The first is the number of non-insertions to be done and the second is the actual insertion.

So, gsx (the delta in aa☃.net) would encode two non-insertions (to skip the aa) followed by an insertion of . The esx delta (for ☃aa.net) would encode zero non-insertions followed by an insertion of .

That is how position is encoded into the basic strings.