Given a primary domain, I am attempting to extract it and its sub-domains from within a string.
For example for primary domain example.co
I want to:
- extract only the primary domain and sub-domains -
example.co
,www.example.co
,uat.smile.example.co
- not pickup names that extend to the right - no
www.example.com
,www.example.co.nz
- ignore any space or punctuation character that is not legal in a FQDN as delimiter
Currently I am getting unwanted items from:
example.com
example.co.nz
Also test-me.www.example.co
includes the trailing space.
>>> domain = 'example\.co'
>>> line = 'example.com example.co.nz www.example.co. test-me.www.example.co bad.example-co.co'
>>> re.findall("[^\s\',]*{}[\s\'\,]*".format(domain), line)
['example.co', 'example.co', 'www.example.co', 'test-me.www.example.co ']
Should I be using regular expressions. If so, guidance on working through this would be much appreciated.
Otherwise is there a better tool for the job?
Edit - Verified Marc Lambrichs' answer but it fails for the case illustrated below:
import re
pattern = r"((?:[a-zA-Z][\w-]+\.)+{}(?!\w))"
domain = 'google.com'
line = 'google.com mail is handled by 20 alt1.aspmx.l.google.com.'
results = re.findall(pattern.format(re.escape(domain)), line)
print(results)
[]
Also, I would like to pass string like 'google.com' instead of 'google.com' and escape with re
but re.escape(domain)
code returns empty list either way.
You can use a regex for this without any splitting whatsoever.
gives as result:
explanation of regex