Parsing several FQDNs from string

1k views Asked by At

Given a primary domain, I am attempting to extract it and its sub-domains from within a string.
For example for primary domain example.co I want to:

  • extract only the primary domain and sub-domains - example.co, www.example.co, uat.smile.example.co
  • not pickup names that extend to the right - no www.example.com, www.example.co.nz
  • ignore any space or punctuation character that is not legal in a FQDN as delimiter

Currently I am getting unwanted items from:
example.com
example.co.nz
Also test-me.www.example.co includes the trailing space.

>>> domain = 'example\.co'
>>> line = 'example.com example.co.nz www.example.co. test-me.www.example.co bad.example-co.co'
>>> re.findall("[^\s\',]*{}[\s\'\,]*".format(domain), line)
['example.co', 'example.co', 'www.example.co', 'test-me.www.example.co ']

Should I be using regular expressions. If so, guidance on working through this would be much appreciated.
Otherwise is there a better tool for the job?

Edit - Verified Marc Lambrichs' answer but it fails for the case illustrated below:

import re

pattern = r"((?:[a-zA-Z][\w-]+\.)+{}(?!\w))"
domain = 'google.com'
line = 'google.com mail is handled by 20 alt1.aspmx.l.google.com.'
results = re.findall(pattern.format(re.escape(domain)), line)
print(results)
[]  

Also, I would like to pass string like 'google.com' instead of 'google.com' and escape with re but re.escape(domain) code returns empty list either way.

1

There are 1 answers

4
Marc Lambrichs On BEST ANSWER

You can use a regex for this without any splitting whatsoever.

$ cat test.py
import re

tests = { 'example.co': 'example.com example.co.nz www.example.co. test-me.www.example.co bad.example-co.co',
          'google.com': 'google.com mail is handled by 20 alt1.aspmx.l.google.com.'}


pattern = r"((?:[a-zA-Z][-\w]*\.)*{}(?!\w))"

for domain,line in tests.iteritems():
    domain = domain.replace(".", "\\.")
    results = re.findall(pattern.format(domain), line)
    print results

gives as result:

$ python test.py
['google.com', 'alt1.aspmx.l.google.com']
['example.co', 'www.example.co', 'test-me.www.example.co']

explanation of regex

(                  # group 1 start
  (?:              # non-capture group
     [a-zA-Z]      # rfc 1034. start subdomain with a letter
     [\w-]*\.      # 0 or more word chars or '-', followed by '.'
  )*               # repeat this non-capture group 0 or more times
  example.co       # match the domain
  (?!\w)           # negative lookahead: no following word char allowed.
)                  # group 1 end