Extract image using Newspaper from HTML

1.7k views Asked by At

I can't download articles like one usually does to instantiate the Article object, like below:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image

However, I can get the HTML from a request. Can I use this raw HTML and pass it somehow to Newspaper to extract the image from it? (below is an attempt, but doesn't work). Thanks

from newspaper import Article
import requests
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image
3

There are 3 answers

0
Life is complex On BEST ANSWER

The Python module Newspaper allows proxies to be used, but this feature is not listed within the module's documentation.


Proxies with Newspaper

from newspaper import Article
from newspaper.configuration import Configuration

# add your corporate proxy information and test the connection
PROXIES = {
           'http': "http://ip_address:port_number",
           'https': "https://ip_address:port_number"
          }

config = Configuration()
config.proxies = PROXIES

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

Requests with Proxies and Newspaper

import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
4
char On

First of make sure you're using python3, that you have run pip3 install newspaper3k before.

Then if you're getting SSL errors with the first version (like below)

/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981: InsecureRequestWarning: Unverified HTTPS request is being made to host 'fox13now.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings warnings.warn(

you can disable them by adding

import urllib3
urllib3.disable_warnings()

This should work:

from newspaper import Article
import urllib3
urllib3.disable_warnings()


url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)

Run with python3 <yourfile>.py.


Setting the html in the Article yourself won't do you much good, as you'd not get anything in the other fields that way. Let me know if that fixes the issue, or if any other errors pop up!

0
Andrew Anderson On

I think here is what you want:

from newspaper import fulltext

html = 'your html'

text_from_html = fulltext(html)