BeautifulSoup / lxml: Are there problems with large elements?

Question

BeautifulSoup / lxml: Are there problems with large elements?

2.2k views Asked by christophrus At 16 April 2013 at 05:31

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "lxml")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

Output:

ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
Python 2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, re, sys, urllib2
>>> from bs4 import BeautifulSoup
>>> import lxml
>>>
>>> html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
>>> soup = BeautifulSoup(html, "lxml")
>>> divs = soup.find_all("div", {"class":"block"})
>>> print len(divs)
2

I also tried:

divs = soup.find_all(class_="block")

with same result ...

But there are 11 elements that fit this condition. So are there any limitations such as max element size resp. how can I get all the elements?

Original Q&A

There are 1 answers

**Anthon** · Answer 1 · 2013-04-23T07:01:37+00:00

The easiest way is probably using the 'html.parser' instead of 'lxml':

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "html.parser")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

With your original code (using lxml) it printed 1 for me, but this prints 11. lxml is lenient but not as lenient as html.parser for this page.

Please note that the page has over one thousand warnings if you run it through tidy. Including invalid character codes, unclosed <div>s, letters like < and / at positions they are not allowed.

TechQA.

BeautifulSoup / lxml: Are there problems with large elements?

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-2.7

Related Questions in BEAUTIFULSOUP

Related Questions in LXML

Related Questions in ACTIVEPYTHON

Popular Questions

Trending Questions