Newspaper3k scrape several websites

481 views Asked by At

I want to get articles from several websites. I tried this but I don't know what I have to do next

lm_paper = newspaper.build('https://www.lemonde.fr/')
parisien_paper = newspaper.build('https://www.leparisien.fr/')

papers = [lm_paper, parisien_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()
1

There are 1 answers

0
Life is complex On BEST ANSWER

Below is the way you can use newspaper news_pool. I did note that the processing time for news_pool is time intensive, because it takes minutes to start printing titles. I believe that this time lag is related to the articles being downloaded in the background. I'm unsure how to speed this process up using Newspaper.

import newspaper
from newspaper import Config
from newspaper import news_pool

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

lm_paper = newspaper.build('https://www.lemonde.fr/', config=config, memoize_articles=False)
parisien_paper = newspaper.build('https://www.leparisien.fr/', config=config, memoize_articles=False)
french_papers = [lm_paper, parisien_paper]

# this setting is adjustable 
news_pool.config.number_threads = 2

# this setting is adjustable 
news_pool.config.thread_timeout_seconds = 1

news_pool.set(french_papers)
news_pool.join()

for source in french_papers:
for article_extract in source.articles:
    if article_extract:
        article_extract.parse()
        print(article_extract.title)