I'm trying to scrape tvtropes with beautifulsoup, but for some reason the data I want is cut out. I'm talking even when I return the entire "soup" from the page. The specific example is this website: http://tvtropes.org/pmwiki/pmwiki.php/Series/Firefly
I want to scrape all the tropes in the folders at the bottom. For some reason after "I was aimin' in the A-D folder under the Accidental Aiming Skills list item, it stops returning data from these folders. Then it prints out stuff in the . I'm doing everything right so I don't understand what the problem is. Does tvtropes not allow you to scrape the entire page for some reason?
def webcrawler(startingurl):
request = urllib2.Request(startingurl)
url = urllib2.urlopen(request)
soup = BeautifulSoup(url)
print soup.prettify().encode('UTF-8')
#this does the same thing
for item in soup.findAll('a', {'class':'twikilink'}):
if 'Main' in str(item):
print item, '\n'
webcrawler("http://tvtropes.org/pmwiki/pmwiki.php/Series/" + 'Firefly')
try this,
and then edit your code to,