RFC 3986 specifies that the host component of a URI is 'case insensitive'. However, it doesn't specify what 'case insensitive' means in terms of UCS or UTF-8 characters.
Examples given in the RFC (e.g. "<HTTP://www.EXAMPLE.com/
> is equivalent to <http://www.example.com/
>") allow us to infer that 'case insensitive' means at least that the characters A-Z are considered equivalent to the character 32 ahead of them in the UTF-8 character set, i.e. a-z. However, no mention is made of how characters outside this range should be treated. So, given an non-encoded, non-normalised registered name of www.OLÉ.com, I see three potential forms of normalisation permissible by the RFC:
- Lower case to www.olé.com then percent encode to www.ol%E9.com
- Lower case only A-Z characters to www.olÉ.com and then percent encode to www.ol%C9.com
- Percent encode to www.OL%C9.com, and then lower case the non-percent encoded parts to www.ol%C9.com, producing the same result as 2.
So the question is: Which is correct? If it's case 1., what defines which characters are considered upper case, and which are considered lower case (and which characters don't have a case)?
Hostnames resolved by DNS are always lowercase.
It is not possible to have UTF-8 characters in DNS hostnames (RFC 1123), however, a workaround has been put in place with "internationalized domain names". This workaround is commonly known as punycode.
Punycode enables non ASCII characters to be represented by ASCII characters.
As for the example that you have provided in your question (
www.olé.com
), the domain name that would be resolved is not www.ol%E9.com.If you are getting percentage signs in your domain name, it means that you have URL-encoded the hostname, and that is not correct, at least not for resolving.
For example, it will work correctly to have an
a
tag that looks like this:However, the DNS server will not resolve
www.ol%C3%A9.com
, but rather, the converted domain name as punycode:Example
becomes
which in punycode translates to:
Web browsers will generally convert uppercase characters to the lowercase version. For example, both
www.olé.com
andwww.olÉ.com
translate to the same DNS hostname (www.xn--ol-cja.com
), becausewww.olÉ.com
was lowercased towww.olé.com
.I recommend two tools to check IDN domain names to see what a domain name looks like once it goes through the punycode translation:
Verisign's IDN tool is much stricter. Try both tools with
www.olÉ.com
as the input to see what I mean.The rules for IDNA (Internationalized Domain Names for Applications) are complicated, but there are two main RFC's that are worth a look at:
https://www.rfc-editor.org/rfc/rfc5894
https://www.rfc-editor.org/rfc/rfc5892
rfc5894 section 3.1.3 specifies that characters may not be allowed if: