As an absolute newbie on the topic of using python, I stumbled over a few difficulties using the newspaper library extension. My goal is to use the newspaper extension on a regular basis to download all new articles of a German news website called "tagesschau" and all articles from CNN to build a data stack I can analyze in a few years. If I got it right I could use the following commands to download and scrape all articles into the python library.
import newspaper
from newspaper import news_pool
tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')
papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()`
If that's the right way to download all articles, so how I can extract and save those outside of python? Or saving those articles in python so that I can reuse them if I restart python again?
Thanks for your help.
The following codes will save the downloaded articles in HTML format. In the folder, you'll find.
tagesschau_paper0.html, tagesschau_paper1.html, tagesschau_paper2.html, .....
Note:
news_pool
doesn't get anything from CNN, so I skipped to write codes for it. If you checkcnn_paper.size()
, it results to0
. You have to import and use Source instead.The above codes can be followed as an example to save articles in other formats too, e.g. txt and also only parts that you need from the articles e.g. authors, body, publish_date.