I am attempting to create a small script to simply take a given website along with a keyword, follow all the links a certain number of times(only links on website's domain), and finally search all the found links for the keyword and return any successful matches. Ultimately it's goal is if you remember a website where you saw something and know a good keyword that the page contained, this program might be able to help find the link to the lost page. Now my bug: upon looping through all these pages, extracting their URLs, and creating a list of them, it seems to somehow end up redundantly going over and removing the same links from the list. I did add a safeguard in place for this but it doesn't seem to be working as expected. I feel like some url(s) are mistakenly being duplicated into the list and end up being checked an infinite number of times.
Here's my full code(sorry about the length), problem area seems to be at the very end in the for loop:
import bs4, requests, sys
def getDomain(url):
if "www" in url:
domain = url[url.find('.')+1:url.rfind('.')]
elif "http" in url:
domain = url[url.find("//")+2:url.rfind('.')]
else:
domain = url[:url.rfind(".")]
return domain
def findHref(html):
'''Will find the link in a given BeautifulSoup match object.'''
link_start = html.find('href="')+6
link_end = html.find('"', link_start)
return html[link_start:link_end]
def pageExists(url):
'''Returns true if url returns a 200 response and doesn't redirect to a dns search.
url must be a requests.get() object.'''
response = requests.get(url)
try:
response.raise_for_status()
if response.text.find("dnsrsearch") >= 0:
print response.text.find("dnsrsearch")
print "Website does not exist"
return False
except Exception as e:
print "Bad response:",e
return False
return True
def extractURLs(url):
'''Returns list of urls in url that belong to same domain.'''
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text)
matches = soup.find_all('a')
urls = []
for index, link in enumerate(matches):
match_url = findHref(str(link).lower())
if "." in match_url:
if not domain in match_url:
print "Removing",match_url
else:
urls.append(match_url)
else:
urls.append(url + match_url)
return urls
def searchURL(url):
'''Search url for keyword.'''
pass
print "Enter homepage:(no http://)"
homepage = "http://" + raw_input("> ")
homepage_response = requests.get(homepage)
if not pageExists(homepage):
sys.exit()
domain = getDomain(homepage)
print "Enter keyword:"
#keyword = raw_input("> ")
print "Enter maximum branches:"
max_branches = int(raw_input("> "))
links = [homepage]
for n in range(max_branches):
for link in links:
results = extractURLs(link)
for result in results:
if result not in links:
links.append(result)
Partial output(about .000000000001%):
Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
You are repeatedly looping over the same link multiple times with your outer loop:
I would also be careful appending to a list you are iterating over or you could well end up with an infinite loop