What does 'case insensitive' mean in RFC 3986 with respect to non-English characters?

1.2k views Asked by At

RFC 3986 specifies that the host component of a URI is 'case insensitive'. However, it doesn't specify what 'case insensitive' means in terms of UCS or UTF-8 characters.

Examples given in the RFC (e.g. "<HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>") allow us to infer that 'case insensitive' means at least that the characters A-Z are considered equivalent to the character 32 ahead of them in the UTF-8 character set, i.e. a-z. However, no mention is made of how characters outside this range should be treated. So, given an non-encoded, non-normalised registered name of www.OLÉ.com, I see three potential forms of normalisation permissible by the RFC:

  1. Lower case to www.olé.com then percent encode to www.ol%E9.com
  2. Lower case only A-Z characters to www.olÉ.com and then percent encode to www.ol%C9.com
  3. Percent encode to www.OL%C9.com, and then lower case the non-percent encoded parts to www.ol%C9.com, producing the same result as 2.

So the question is: Which is correct? If it's case 1., what defines which characters are considered upper case, and which are considered lower case (and which characters don't have a case)?

1

There are 1 answers

4
Tim Groeneveld On

Hostnames resolved by DNS are always lowercase.

It is not possible to have UTF-8 characters in DNS hostnames (RFC 1123), however, a workaround has been put in place with "internationalized domain names". This workaround is commonly known as punycode.

Punycode enables non ASCII characters to be represented by ASCII characters.

non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).

-- https://www.ietf.org/rfc/rfc3492.txt

As for the example that you have provided in your question (www.olé.com), the domain name that would be resolved is not www.ol%E9.com.

If you are getting percentage signs in your domain name, it means that you have URL-encoded the hostname, and that is not correct, at least not for resolving.

For example, it will work correctly to have an a tag that looks like this:

<a href="//www.ol%C3%A9.com">Click Here</a>

However, the DNS server will not resolve www.ol%C3%A9.com, but rather, the converted domain name as punycode:

Example

www.ol%C3%A9.com

becomes

www.olé.com

which in punycode translates to:

www.xn--ol-cja.com

Web browsers will generally convert uppercase characters to the lowercase version. For example, both www.olé.com and www.olÉ.com translate to the same DNS hostname (www.xn--ol-cja.com), because www.olÉ.com was lowercased to www.olé.com.

I recommend two tools to check IDN domain names to see what a domain name looks like once it goes through the punycode translation:

Verisign's IDN tool is much stricter. Try both tools with www.olÉ.com as the input to see what I mean.

The rules for IDNA (Internationalized Domain Names for Applications) are complicated, but there are two main RFC's that are worth a look at:

  • Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale
    https://www.rfc-editor.org/rfc/rfc5894
  • The Unicode Code Points and Internationalized Domain Names for Applications
    https://www.rfc-editor.org/rfc/rfc5892

rfc5894 section 3.1.3 specifies that characters may not be allowed if:

  • The character is an uppercase form or some other form that is mapped to another character by Unicode case folding.