I try to find emails into html using regex but I have problems with some websites.
The main problem is that regex function paralyzes the process and leaves the cpu overloaded.
import re
from urllib.request import urlopen, Request
email_regex = re.compile('([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})', re.IGNORECASE)
request = Request('http://www.serviciositvyecla.com')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
html = str(urlopen(request, timeout=5).read().decode("utf-8", "strict"))
email_regex.findall(html) ## here is where regex takes a long time
I have not problems if the website is another one.
request = Request('https://www.velezmalaga.es/')
If someone know how to solve this problem or know how to timeout the regex function, I will appreciate it.
I use Windows.
I initially tried fiddling with your approach, but then I ditched it and resorted to
BeautifulSoup
. It worked.Try this:
Output:
EDIT:
One reason your approach doesn't work is, well, the regex itself. The other one is the size (I suspect) of the HTML returned.
See this:
This prints: