Given a URL, I have to find the hostname by using a regex.
The URLs can be of varied forms:
http://www.google.com/ [expected 'google.com']
https://www.google.com:2000/ [expected 'www.google.com']
http://100.1.25.3:8000/foo/bar?abc.php=xxxx+xxxx [expected '100.1.25.3']
www.google.com [expected 'www.google.com']
10.0.2.2:5000 [expected '10.0.2.2']
localhost/ [expected 'localhost']
localhost/foo [expected 'localhost']
The closest I could come up is with:
^(?:[^:]+://)*([^:/]+).*
and use the string captured by the first capturing group of the regular expression.
However, a few cases fail:
google.com [nothing is captured, expected 'google.com']
http://///x ['http' is captured, expected nothing]
What would be a regex that can cope up with these cases?
Please note that:
- I'm not asking what is wrong with my regex. I know where things are wrong, I just can't come up with another regex.
- Solutions only need to reliably extract the hostname, and need not validate it. I later on validate this stuff, so if the regex takes out
google!com
fromhttps://google!com/foo
, this is acceptable*.
* ... and probably even desirable, since hostnames can contain Unicode characters (Internationalized Domain Names).
I came up with this
^
- Indicates it must start with this regex(?:[a-zA-Z\d][a-zA-Z\d-]+){1}
- Matches the hostname(?:\.[a-zA-Z]{2,6})+
- Matches one or more TLDs.(co.uk)
$
- Indicates it must end with this regex