I have installed Newspapper3k
Lib on my Mac with sudo pip3 install Newspapper3k
. Im using Python 3.
I want to return data thats supported at Article object, and that is url, date, title, text, summarisation and keywords but I do not get any data:
import newspaper
from newspaper import Article
#creating website for scraping
cnn_paper = newspaper.build('https://www.euronews.com/', memoize_articles=False)
#I have tried for https://www.euronews.com/, https://edition.cnn.com/, https://www.bbc.com/
for article in cnn_paper.articles:
article_url = article.url #works
news_article = Article(article_url)#works
print("OBJECT:", news_article, '\n')#works
print("URL:", article_url, '\n')#works
print("DATE:", news_article.publish_date, '\n')#does not work
print("TITLE:", news_article.title, '\n')#does not work
print("TEXT:", news_article.text, '\n')#does not work
print("SUMMARY:", news_article.summary, '\n')#does not work
print("KEYWORDS:", news_article.keywords, '\n')#does not work
print()
input()
I get Article object and URL but everything else is ''. I have tried on different websites, but result is the same.
Then I tried to add:
news_article.download()
news_article.parse()
news_article.nlp()
I have also tried to set Config and to set HEADERS and TIMEOUTs but results are the same.
When I do that, for each website I get only 16 Articles with date, title, and body values. That is very strange to me, for each website I'm getting the same number of data, but for more than 95% of news articles I'm getting None.
Can Beautiful Soup help me?
Can someone help me with understanding what is the problem, why I'm getting so much Null/Nan/"" values, and how can I fix that?
This is the docs for lib:
I would recommend that you review the newspaper overview document that I published on GitHub. The document has multiple extraction examples and other techniques that might be useful.
Concerning your question...
Newspaper3K will parse certain websites nearly flawlessly. But there are plenty of websites that will require reviewing a page's navigational structure to determine how to parse the article elements correctly.
For instance, https://www.marketwatch.com has individual article elements, such as title, publish date and others items stored within the meta tag section of the page.
The newspaper example below will parse the elements correctly. I noted that you might need to do some data cleaning of the keyword or tag output.
https://www.euronews.com is similar to https://www.marketwatch.com, except some of the article elements are located in the main body and other items are within the meta tag section.