Timeout while using regex Python3

341 views Asked by At

I try to find emails into html using regex but I have problems with some websites.

The main problem is that regex function paralyzes the process and leaves the cpu overloaded.

import re
from urllib.request import urlopen, Request

email_regex = re.compile('([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})', re.IGNORECASE)

request = Request('http://www.serviciositvyecla.com')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
html = str(urlopen(request, timeout=5).read().decode("utf-8", "strict"))

email_regex.findall(html) ## here is where regex takes a long time

I have not problems if the website is another one.

request = Request('https://www.velezmalaga.es/')

If someone know how to solve this problem or know how to timeout the regex function, I will appreciate it.

I use Windows.

2

There are 2 answers

5
baduker On

I initially tried fiddling with your approach, but then I ditched it and resorted to BeautifulSoup. It worked.

Try this:

import re
import requests

from bs4 import BeautifulSoup


headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}

pages = ['http://www.serviciositvyecla.com', 'https://www.velezmalaga.es/']

emails_found = set()
for page in pages:
    html = requests.get(page, headers=headers).content
    soup = BeautifulSoup(html, "html.parser").select('a[href^=mailto]')
    for item in soup:
        try:
            emails_found.add(item['href'].split(":")[-1].strip())
        except ValueError:
            print("No email :(")

print('\n'.join(email for email in emails_found))

Output:

[email protected]
[email protected]

EDIT:

One reason your approach doesn't work is, well, the regex itself. The other one is the size (I suspect) of the HTML returned.

See this:

import re
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}

html = requests.get('https://www.velezmalaga.es/', headers=headers).text

op_regx = '([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})'
simplified_regex = '[\w\.-]+@[\w\.-]+\.\w+'

print(f"OP's regex results: {re.findall(op_regx, html)}")
print(f"Simplified regex results: {re.findall(simplified_regex, html)}")

This prints:

OP's regex results: []
Simplified regex results: ['[email protected]', '[email protected]']
0
Joakin Montesinos On

Finally, I found a solution for no consume all RAM with a regex search. In my problem, obtaining a white result even though there is email on the web is an acceptable solution, as long as not to block the process due to lack of memory. The html of the scraped page contained 5.5 million characters. 5.1 millions did not contain priority information, since it is a hidden div with unintelligible characters. I have added an exception similar than: if len(html) < 1000000: do whathever