SQL SERVER get hostname and path but not parameters

1.5k views Asked by At

I have to get the hostname & pathname from an URL using regular expression.

URL can be in the form of

a) http://test.com/mary/archive/project.aspx --> I want http::/test.com/mary
b) http://www.testmary.com/company --> I want http://www.testmary.com/
c) http://marytest.com/b/about/--> I want http://marytest.com/
d) http://mary.test.com/b/mary/project.aspx --> I want http://mary.test.com/b/mary/

Basically I am looking get records that have url "mary" and get only hostname or pathname depending on where "mary" is.

Any help is appreciated.

Thanks R

1

There are 1 answers

5
Aaron Bertrand On

You don't need CLR and RegEx for this.

DECLARE @x TABLE(id INT, url VARCHAR(2048));

INSERT @x VALUES
(1,'http://test.com/mary/archive/project.aspx'), 
    --> I want http::/test.com/mary
(2,'http://www.testmary.com/company'), 
    --> I want http://www.testmary.com/
(3,'http://marytest.com/b/about/'),
    --> I want http://marytest.com/
(4,'http://mary.test.com/b/mary/project.aspx'),
    --> I want http://mary.test.com/b/mary/
(5,'mary.test.com');
    --> I want mary.test.com

SELECT ID, 
  [output] = SUBSTRING(url, 1, LEN(url) - CHARINDEX('yram', REVERSE(url)) 
    + COALESCE(NULLIF(CHARINDEX('/', SUBSTRING(url,  LEN(url) - CHARINDEX('yram',  
      REVERSE(url)) + 1, 2048)),0),2048))
FROM @x
WHERE url LIKE '%mary%';

Results:

ID  output
--  --------------------------------
1   http://test.com/mary/
2   http://www.testmary.com/
3   http://marytest.com/
4   http://mary.test.com/b/mary/
5   mary.test.com

The only thing I didn't understand is why the first row in the output should be missing a trailing slash, while the other rows include it. In my query this trailing slash is included in all rows. If it shouldn't be included only on that one row, you'll need to explain why.

I do recommend getting a better handle on data cleansing, though. Why would you allow some URLs without the http:// prefix?