I need a function:
f(fqdn,suffix) -> basedomain
with these example inputs and outputs:
f('foobar.quux.somedomain.com','com') -> 'somedomain.com'
f('somedomain.com','com') -> 'somedomain.com'
f('foobar.quux.somedomain.com.br','com.br') -> 'somedomain.com.br'
f('somedomain.com.br','com.br') -> 'somedomain.com.br'
In plain English, if the suffix
has n
segments, take the last n+1
segments. Find the base domain for the FQDN, allowing for the fact that some FQDNs have more than one suffix element.
The suffixes I need to match are here. I've already got them in my SQL database.
I could write this in C#; it might not be the most elegant but it would work. Unfortunately I would like to have this function in either T-SQL, where it is closest to the data, or in Powershell, which is where the rest of the utility that consumes this data is going to be. I suppose it would be ok to do it in C#, compile to an assembly and then access it from T-SQL, or even from Powershell ... if that would be the fastest executing. If there's some reasonably clever alternative in pure T-SQL or simple Powershell, I'd like that.
EDIT: One thing I forgot to mention explicitly (but which is clear when reviewing the suffix list, at my link above) is that we must pick the longest matching suffix. Both "br" and "com.br" appear in the suffix list (with similar things happening for uk, pt, etc). So the SQL has to use a window function to make sure the longest matching suffix is found.
Here is how far I got when I was doing the SQL. I had gotten lost in all the substring
/reverse
functions.
SELECT Domain, suffix
FROM (
SELECT SD.Domain, SL.suffix,
RN=ROW_NUMBER() OVER (
PARTITION BY sd.Domain ORDER BY LEN(SL.suffix) DESC)
FROM SiteDomains SD
INNER JOIN suffixlist SL ON SD.Domain LIKE '%.'+SL.suffix
) AS X
WHERE RN=1
This works ok for finding the right suffix. I'm a little concerned about its performance though.
The following demonstrates matching FQDNs with TLDs and extracting the desired n + 1 domain name segments: