display IDNs from normalized URIs with ruby (using the Addressable Gem)

682 views Asked by At

In my Ruby app I need to handle URIs from user input (which are actually IRIs)

str = "http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"

I normalize these using Addressable, and only store the normalized form:

normalized = Addressable::URI.parse(str).normalize
normalized.to_s
#=> http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0

This is nice to work with, but obviously not nice to display to end users.

For that I'd like to convert this URI back to its original form (non-punycode, non-percent-encoded-path)

Addressable has display_uri, but that only converts the host:

nicer = normalized.display_uri.to_s
#=> http://उदाहरण.परीक्षा/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0

This looks like it works:

display_s = Addressable::URI.parse(str).display_uri.to_s
pretty = Addressable::URI.unencode(display_s.force_encoding("ASCII-8BIT"))

However, that code looks wrong (I should not need to use force_encoding) and I'm not at all confident that it is correct.

  • What is a good, sane way to convert the entire URI to something usable for end users ("http://उदाहरण.परीक्षा/मुख्य_पृष्ठ")

  • is storing the URIs normalized even a good idea or does that have consequences I might not be aware of?

code: https://gist.github.com/levinalex/6115764

tl;dr

how do I convert this:

"http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/" +
"%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4" +
"%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0"

to this:

"http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"
1

There are 1 answers

1
i-blis On BEST ANSWER

You should not need any forced (re-)encoding to recover the original URI. Simply:

normalised_s = "http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0"        
Addressable::URI.unencode(Addressable::URI.parse(normalised_s).display_uri)

=> "http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"

To repeat what Bob said in the comments, normalisation is definitely a good way of guaranteeing uniqueness for storage.