Remove boilerplate content from HTML page

Question

Remove boilerplate content from HTML page

2.1k views Asked by Crista23 At 13 June 2015 at 09:22

I would like to use the jusText implementation found here https://github.com/miso-belica/jusText to get the clean content out of an html page. Basically it works like this:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
      print paragraph.text

I have already downloaded the pages that I would like to parse using this tool (some of them are no longer available online), and I extract the html content out of them. Since jusText appears to be only working on the output of a request (which is a response type object), I am wondering if there is any custom way to set the content of a response object to contain the html text I would like to parse.

Original Q&A

There are 1 answers

**James Mills** · Accepted Answer · 2015-06-13T09:29:04+00:00

James Mills On 13 June 2015 at 09:29 BEST ANSWER

response.content is of <type 'str'>

>>> from requests import get
>>> r = get("http://www.google.com/")
>>> type(r.content)
<type 'str'>

So just call:

justext.justext(my_html_string, justext.get_stoplist("English"))

TechQA.

Remove boilerplate content from HTML page

There are 1 answers

Related Questions in PYTHON

Related Questions in REQUEST

Related Questions in RESPONSE

Related Questions in HTMLCLEANER

Popular Questions

Popular Tags

Trending Questions