Link Scraping Program Redundancy?

65 views Asked by At

I am attempting to create a small script to simply take a given website along with a keyword, follow all the links a certain number of times(only links on website's domain), and finally search all the found links for the keyword and return any successful matches. Ultimately it's goal is if you remember a website where you saw something and know a good keyword that the page contained, this program might be able to help find the link to the lost page. Now my bug: upon looping through all these pages, extracting their URLs, and creating a list of them, it seems to somehow end up redundantly going over and removing the same links from the list. I did add a safeguard in place for this but it doesn't seem to be working as expected. I feel like some url(s) are mistakenly being duplicated into the list and end up being checked an infinite number of times.

Here's my full code(sorry about the length), problem area seems to be at the very end in the for loop:

import bs4, requests, sys

def getDomain(url):
    if "www" in url:
        domain = url[url.find('.')+1:url.rfind('.')]
    elif "http" in url:
        domain = url[url.find("//")+2:url.rfind('.')]
    else:
        domain = url[:url.rfind(".")]
    return domain

def findHref(html):
    '''Will find the link in a given BeautifulSoup match object.'''
    link_start = html.find('href="')+6
    link_end = html.find('"', link_start)
    return html[link_start:link_end]

def pageExists(url):
    '''Returns true if url returns a 200 response and doesn't redirect to a dns search.
    url must be a requests.get() object.'''
    response = requests.get(url)
    try:
        response.raise_for_status()
        if response.text.find("dnsrsearch") >= 0:
            print response.text.find("dnsrsearch")
            print "Website does not exist"
            return False
    except Exception as e:
        print "Bad response:",e
        return False
    return True

def extractURLs(url):
    '''Returns list of urls in url that belong to same domain.'''
    response = requests.get(url)
    soup = bs4.BeautifulSoup(response.text)
    matches = soup.find_all('a')
    urls = []
    for index, link in enumerate(matches):
        match_url = findHref(str(link).lower())
        if "." in match_url:
            if not domain in match_url:
                print "Removing",match_url
            else:
                urls.append(match_url)
        else:
            urls.append(url + match_url)
    return urls

def searchURL(url):
    '''Search url for keyword.'''
    pass

print "Enter homepage:(no http://)"
homepage = "http://" + raw_input("> ")
homepage_response = requests.get(homepage)
if not pageExists(homepage):
    sys.exit()
domain = getDomain(homepage)

print "Enter keyword:"
#keyword = raw_input("> ")
print "Enter maximum branches:"
max_branches = int(raw_input("> "))

links = [homepage]
for n in range(max_branches):
    for link in links:
        results = extractURLs(link)
        for result in results:
            if result not in links:
                links.append(result)

Partial output(about .000000000001%):

Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
2

There are 2 answers

0
Padraic Cunningham On

You are repeatedly looping over the same link multiple times with your outer loop:

for n in range(max_branches): 
   for link in links: 
       results = extractURLs(link)

I would also be careful appending to a list you are iterating over or you could well end up with an infinite loop

0
Jordan On

Okay, I found a solution. All I did was change the links variable to a dictionary with the values 0 representing a not searched link and 1 representing a searched link. Then I iterated through a copy of the keys in order to preserve the branches and not let it wildly go follow every link that is added on in the loop. And finally if a link is found that is not already in links it is added and set to 0 to be searched.

links = {homepage: 0}
for n in range(max_branches):
    for link in links.keys()[:]:
        if not links[link]:
            results = extractURLs(link)
            for result in results:
                if result not in links:
                    links[result] = 0