Is there a workaround to open URLs containing underscores in Ruby?

25.9k views Asked by At

I'm using open-uri to open URLs.

resp = open("http://sub_domain.domain.com")

If it contains underscore I get an error:

URI::InvalidURIError: the scheme http does not accept registry part: sub_domain.domain.com (or bad hostname?)

I understand that this is because according to RFC URLs can contain only letters and numbers. Is there any workaround?

9

There are 9 answers

1
stef On BEST ANSWER

This looks like a bug in URI, and uri-open, HTTParty and many other gems make use of URI.parse.

Here's a workaround:

require 'net/http'
require 'open-uri'

def hopen(url)
  begin
    open(url)
  rescue URI::InvalidURIError
    host = url.match(".+\:\/\/([^\/]+)")[1]
    path = url.partition(host)[2] || "/"
    Net::HTTP.get host, path
  end
end

resp = hopen("http://dear_raed.blogspot.com/2009_01_01_archive.html")
9
Earlz On

An underscore can not be contained in a domain name like that. That is part of the DNS standard. Did you mean to use a dash(-)?

Even if open-uri didn't throw an error such a command would be pointless. Why? Because there is no way it can resolve such a domain name. At best you'd get an unknown host error. There is no way for you to register a domain name with an _ in it, and even running your own private DNS server, it is against the specification to use a _. You could bend the rules and allow it(by modifying the DNS server software), but then your operating system's DNS resolver won't support it, neither will your router's DNS software.

Solution: Don't try to use a _ in a DNS name. It won't work anywhere and it's against the specifications

0
TomDavies On

I recommend using the Curb gem: https://github.com/taf2/curb which just wraps libcurl. Here is a simple example that will automatically follow redirects and print the response code and response body:

rsp = Curl::Easy.http_get(url){|curl| curl.follow_location = true; curl.max_redirects=10;}
puts rsp.response_code
puts rsp.body_str

I usually avoid the ruby URI classes since they are too strick to the spec which as you know the web is the wild west :) Curl / curb handles every url I throw at it like a champ.

1
Vasfed On

For anyone stumbling upon this:

Ruby's URI.parse used to be based on RFC2396 (published in Aug 1998), see https://bugs.ruby-lang.org/issues/8241

But starting at ruby 2.2 URI is upgraded into RFC 3986, so if you're on a modern version, no monkey patches are necessary now.

1
pguardiario On

URI has an old-fashioned idea of what an url looks like.

Lately I'm using addressable to get around that:

require 'open-uri'
require 'addressable/uri'

class URI::Parser
  def split url
    a = Addressable::URI::parse url
    [a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment]
  end
end

resp = open("http://sub_domain.domain.com") # Yay!

Don't forget to gem install addressable

2
Larry Kyrala On

Here is a patch that solves the problem for a wide variety of situations (rest-client, open-uri, etc.) without using external gems or overriding parts of URI.parse:

module URI
  DEFAULT_PARSER = Parser.new(:HOSTNAME => "(?:(?:[a-zA-Z\\d](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.)*(?:[a-zA-Z](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.?")
end

Source: lib/uri/rfc2396_parser.rb#L86

Ruby-core has an open issue: https://bugs.ruby-lang.org/issues/8241

1
cluesque On

This initializer in my rails app seems to make URI.parse work at least:

# config/initializers/uri_underscore.rb
class URI::Generic
  def initialize_with_registry_check(scheme,
                 userinfo, host, port, registry,
                 path, opaque,
                 query,
                 fragment,
                 parser = DEFAULT_PARSER,
                 arg_check = false)
    if %w(http https).include?(scheme) && host.nil? && registry =~ /_/
      initialize_without_registry_check(scheme, userinfo, registry, port, nil, path, opaque, query, fragment, parser, arg_check)
    else
      initialize_without_registry_check(scheme, userinfo, host, port, registry, path, opaque, query, fragment, parser, arg_check)
    end
  end
  alias_method_chain :initialize, :registry_check
end
0
Julian Mann On

I had this same error while trying to use gem update / gem install etc. so I used the IP address instead and its fine now.

0
sheerun On

Here is another ugly hack, no gem needed:

def parse(url = nil)
    begin
        URI.parse(url)
    rescue URI::InvalidURIError
        host = url.match(".+\:\/\/([^\/]+)")[1]
        uri = URI.parse(url.sub(host, 'dummy-host'))
        uri.instance_variable_set('@host', host)
        uri
    end
end