fetching thousands of urls with Newspaper3k and Multiprocessing slows down after few hundred calls

140 views Asked by At

I have a code which is meant to:

a) call an API to get Google SERP results; b) open each retrieved url with the newspaper3k python3 library, which extracts the text of the news article; c) save the text of the article into a .txt file.

The implementation of the multiprocessing part is as follows:

def createFile(newspaper_article):
    """ function that opens each article, parses it, and saves it to file on disk"""

def main():
    p = ThreadPool(10)
    p.map(partial(createFile), sourcesList)
    p.close()
    p.join()

if __name__ == '__main__':    
    main()

I have also tried with Pool instead of ThreadPool.

The problem is that after fetching and saving a few hundreds articles, it slows down dramatically. Sometimes it may happen that a link takes some time to load but i'd expect the other routines to keep goin in the meantime. What am I doing wrong?

0

There are 0 answers