Remove boilerplate content from HTML page

2.1k views Asked by At

I would like to use the jusText implementation found here https://github.com/miso-belica/jusText to get the clean content out of an html page. Basically it works like this:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
      print paragraph.text

I have already downloaded the pages that I would like to parse using this tool (some of them are no longer available online), and I extract the html content out of them. Since jusText appears to be only working on the output of a request (which is a response type object), I am wondering if there is any custom way to set the content of a response object to contain the html text I would like to parse.

1

There are 1 answers

0
James Mills On BEST ANSWER

response.content is of <type 'str'>

>>> from requests import get
>>> r = get("http://www.google.com/")
>>> type(r.content)
<type 'str'>

So just call:

justext.justext(my_html_string, justext.get_stoplist("English"))