Get more article URLs from a news source with newspaper3k?

941 views Asked by At

When I do

import newspaper
paper = newspaper.build('http://cnn.com', memoize_articles=False)
print(len(paper.articles))

I see that newspaper found 902 articles from http://cnn.com, which seems quite little too me, considering that they publish many articles per day and has published articles online for many years. Are these really all articles there is on http://cnn.com? If not, is there any way I can find the URLs of the rest of the articles too?

1

There are 1 answers

2
Life is complex On

Newspaper is only querying the items on the main page of CNN, so the module does not query all the categories (e.g. business, health, etc.) on the domain. Based on my code, there are only 698 unique articles as of today being discovered by Newspaper. Some of these articles might be the same, because some of the URLs have hashes, but look to be the same article.

P.S. You can query all the categories, but that requires Selenium coupled with Newspaper.

from newspaper import build

articles = []
urls_set = set()
cnn_articles = build('http://cnn.com', memoize_articles=False)
for article in cnn_articles.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     articles.append(article.url)


print(len(articles))
# 698