Need to go from hostname to base domain

380 views Asked by At

I need a function:

f(fqdn,suffix) -> basedomain

with these example inputs and outputs:

f('foobar.quux.somedomain.com','com') -> 'somedomain.com'
f('somedomain.com','com') -> 'somedomain.com'
f('foobar.quux.somedomain.com.br','com.br') -> 'somedomain.com.br'
f('somedomain.com.br','com.br') -> 'somedomain.com.br'

In plain English, if the suffix has n segments, take the last n+1 segments. Find the base domain for the FQDN, allowing for the fact that some FQDNs have more than one suffix element.

The suffixes I need to match are here. I've already got them in my SQL database.

I could write this in C#; it might not be the most elegant but it would work. Unfortunately I would like to have this function in either T-SQL, where it is closest to the data, or in Powershell, which is where the rest of the utility that consumes this data is going to be. I suppose it would be ok to do it in C#, compile to an assembly and then access it from T-SQL, or even from Powershell ... if that would be the fastest executing. If there's some reasonably clever alternative in pure T-SQL or simple Powershell, I'd like that.

EDIT: One thing I forgot to mention explicitly (but which is clear when reviewing the suffix list, at my link above) is that we must pick the longest matching suffix. Both "br" and "com.br" appear in the suffix list (with similar things happening for uk, pt, etc). So the SQL has to use a window function to make sure the longest matching suffix is found.

Here is how far I got when I was doing the SQL. I had gotten lost in all the substring/reverse functions.

SELECT Domain, suffix
FROM (
    SELECT SD.Domain, SL.suffix, 
       RN=ROW_NUMBER() OVER (
           PARTITION BY sd.Domain ORDER BY LEN(SL.suffix) DESC)
    FROM SiteDomains SD
    INNER JOIN suffixlist SL ON SD.Domain LIKE '%.'+SL.suffix
) AS X
WHERE RN=1

This works ok for finding the right suffix. I'm a little concerned about its performance though.

3

There are 3 answers

3
HABO On BEST ANSWER

The following demonstrates matching FQDNs with TLDs and extracting the desired n + 1 domain name segments:

-- Sample data.
declare @SampleTLDs as Table ( TLD VarChar(64) );
insert into @SampleTLDs ( TLD ) values
  ( 'com' ), ( 'somedomain.com' ), ( 'com.br' );
declare @SampleFQDNs as Table ( FQDN VarChar(64) );
insert into @SampleFQDNs ( FQDN ) values
  ( 'foobar.quux.somedomain.com' ), ( 'somedomain.com' ),
  ( 'foobar.quux.somedomain.com.br' ), ( 'somedomain.com.br' );
select * from @SampleTLDs;
select * from @SampleFQDNs;

-- Fiddle about.
select FQDN, TLD,
  case
    when DotPosition = 0 then FQDN
    else Reverse( Left( ReversedPrefix, DotPosition - 1) ) + '.' + TLD
    end as Result
  from (
    select FQDNs.FQDN, TLDs.TLD,
      Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) as ReversedPrefix,
      CharIndex( '.', Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) ) as DotPosition
      from @SampleFQDNs as FQDNs inner join
        @SampleTLDs as TLDs on FQDNs.FQDN like '%.' + TLDs.TLD or FQDNs.FQDN = TLDs.TLD  ) as Edna;

-- To select only the longest matching TLD for each FQDN:
with
  ExtendedFQDNs as (
    select FQDNs.FQDN, TLDs.TLD, Row_Number() over ( partition by FQDN order by Len( TLDs.TLD ) desc ) as TLDLenRank,
      Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) as ReversedPrefix,
      CharIndex( '.', Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) ) as DotPosition
      from @SampleFQDNs as FQDNs inner join
        @SampleTLDs as TLDs on FQDNs.FQDN like '%.' + TLDs.TLD or FQDNs.FQDN = TLDs.TLD )
  select FQDN, TLD,
    case
      when DotPosition = 0 then FQDN
      else Reverse( Left( ReversedPrefix, DotPosition - 1) ) + '.' + TLD
      end as Result
    from ExtendedFQDNs
    where TLDLenRank = 1;
0
S3S On

Here is a tsql variant...

declare @fqdn varchar(256) = 'somedomain.com'
declare @suffix varchar(128) = 'com'

select left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2)

if(select CHARINDEX('.',reverse(left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2)))) = 0
    begin
    select left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2) + '.' + @suffix
    end
else
    begin
    select right(left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2),CHARINDEX('.',reverse(left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2))) - 1) + '.' + @suffix
    end
1
Ross Presser On

Here's how I would do it in C#:

string getBaseDomain(string fqdn, string suffix)
{
    string[] domainSegs = fqdn.Split('.');
    return domainSegs[domainSegs.Length - suffix.Split('.').Length - 1] + "." + suffix;
}

So here it is in Powershell:

function getBaseDomain
{
  Param(
    [string]$fqdn, 
    [string]$suffix 
  )
  $domainSegs = $fqdn.Split(".");
  return $domainSegs[$domainSegs.Length - $suffix.Split(".").Length - 1] + "."+$suffix;
}

Seems rather silly now to have wasted stackoverflow.com's time with this. My apologies.