Python script times out or will not finish running

48 views Asked by At

I've been working on a python script that will scrape certain webpages.

The beginning of the script looks like this:

# -*- coding: UTF-8 -*-
import urllib2
import re

database = ''

contents = open('contents.html', 'r')

for line in contents:
    entry = ''
    f = re.search('(?<=a href=")(.+?)(?=\.htm)', line)
    if f:
        entry = f.group(0)  

        page = urllib2.urlopen('https://indo-european.info/pokorny-etymological-dictionary/' + entry + '.htm').read()

        m = re.search('English meaning(&#160;)+\s+(.+?)</font>', page)
        if m:
            title = m.group(2)
        else:
            title = 'N/A'

This accesses each page and grabs a title from it. Then I have a number of blocks of code that test whether certain text is present in each page, here is an example of one:

    abg = re.findall('\babg\b', page);
        if len(abg) == 0:
            abg = 'N'
        else:
            abg = 'Y'

Then, finally, still in the for loop, I add this information to the variable database:

    database += '\n' + str('<F>') + str(entry) + '<TITLE="' + str(title) + '"><FQ="N"><SQ="N"><ABG="' + str(abg) + '"></F>'

Note that I have used str() for each variable because I was getting a "can't concatenate strings and lists" error for some reason.

Once the for loop is completed, I write the database variable to a file:

f = open('database.txt', 'wb')      
f.write(database)
f.close()

When I run this in the command line, it times out or never completes running. Any ideas as to what might be causing the issue?

EDIT: I fixed it. It seems the program was getting slowed down by the fact that I was having the database variable store the result of each line's iteration through the loop. All I had to do to fix the issue was change the write function to happen during the for loop.

0

There are 0 answers