Get more article URLs from a news source with newspaper3k?

Question

Get more article URLs from a news source with newspaper3k?

945 views Asked by HelloGoodbye At 28 September 2020 at 01:59

When I do

import newspaper
paper = newspaper.build('http://cnn.com', memoize_articles=False)
print(len(paper.articles))

I see that newspaper found 902 articles from http://cnn.com, which seems quite little too me, considering that they publish many articles per day and has published articles online for many years. Are these really all articles there is on http://cnn.com? If not, is there any way I can find the URLs of the rest of the articles too?

Original Q&A

There are 1 answers

**Life is complex** · Answer 1 · 2020-10-02T21:34:36+00:00

Newspaper is only querying the items on the main page of CNN, so the module does not query all the categories (e.g. business, health, etc.) on the domain. Based on my code, there are only 698 unique articles as of today being discovered by Newspaper. Some of these articles might be the same, because some of the URLs have hashes, but look to be the same article.

P.S. You can query all the categories, but that requires Selenium coupled with Newspaper.

from newspaper import build

articles = []
urls_set = set()
cnn_articles = build('http://cnn.com', memoize_articles=False)
for article in cnn_articles.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     articles.append(article.url)


print(len(articles))
# 698

TechQA.

Get more article URLs from a news source with newspaper3k?

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-NEWSPAPER

Related Questions in NEWSPAPER3K

Popular Questions

Popular Tags

Trending Questions