BeautifulSoup / lxml: Are there problems with large elements?

2.2k views Asked by At
import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "lxml")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

Output:

ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
Python 2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, re, sys, urllib2
>>> from bs4 import BeautifulSoup
>>> import lxml
>>>
>>> html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
>>> soup = BeautifulSoup(html, "lxml")
>>> divs = soup.find_all("div", {"class":"block"})
>>> print len(divs)
2

I also tried:

divs = soup.find_all(class_="block")

with same result ...

But there are 11 elements that fit this condition. So are there any limitations such as max element size resp. how can I get all the elements?

1

There are 1 answers

1
Anthon On

The easiest way is probably using the 'html.parser' instead of 'lxml':

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "html.parser")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

With your original code (using lxml) it printed 1 for me, but this prints 11. lxml is lenient but not as lenient as html.parser for this page.

Please note that the page has over one thousand warnings if you run it through tidy. Including invalid character codes, unclosed <div>s, letters like < and / at positions they are not allowed.