Regex for extracting hostname

1.4k views Asked by At

Given a URL, I have to find the hostname by using a regex.

The URLs can be of varied forms:

http://www.google.com/                            [expected 'google.com']
https://www.google.com:2000/                      [expected 'www.google.com']
http://100.1.25.3:8000/foo/bar?abc.php=xxxx+xxxx  [expected '100.1.25.3']
www.google.com                                    [expected 'www.google.com']
10.0.2.2:5000                                     [expected '10.0.2.2']
localhost/                                        [expected 'localhost']
localhost/foo                                     [expected 'localhost']

The closest I could come up is with:

^(?:[^:]+://)*([^:/]+).*

and use the string captured by the first capturing group of the regular expression.

However, a few cases fail:

google.com   [nothing is captured, expected 'google.com']
http://///x  ['http' is captured, expected nothing]

What would be a regex that can cope up with these cases?


Please note that:

  • I'm not asking what is wrong with my regex. I know where things are wrong, I just can't come up with another regex.
  • Solutions only need to reliably extract the hostname, and need not validate it. I later on validate this stuff, so if the regex takes out google!com from https://google!com/foo, this is acceptable*.

* ... and probably even desirable, since hostnames can contain Unicode characters (Internationalized Domain Names).

2

There are 2 answers

1
Richard Hamilton On

I came up with this

/^(?:[a-zA-Z\d][a-zA-Z\d-]+){1}(?:\.[a-zA-Z]{2,6})+$/

^ - Indicates it must start with this regex

(?:[a-zA-Z\d][a-zA-Z\d-]+){1} - Matches the hostname

(?:\.[a-zA-Z]{2,6})+ - Matches one or more TLDs. (co.uk)

$ - Indicates it must end with this regex

0
anubhava On

You can use this regex in PCRE:

'~^(?:[^:\n]+://)?([^:#/\n]*)~m'

RegEx Demo