Newspaper3k: how to retrieve cashed articles?

361 views Asked by At

This document says that that by default, newspaper caches all previously extracted articles and eliminates any article which it has already extracted.

>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
1030

>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
2

Okay, but it says nothing if I build once a website how I can retrieve the cashed articles?

1

There are 1 answers

0
Akash Sahu On

newspaper3k uses memoize to cache the articles for a source

setting memoize to false would stop the caching mechanism

cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)

however, if you still want the caching and want to access the cached articles you can find .newspaper_scraper dir inside the temp folder (Path in windows machine)

C:\Users\your_user\AppData\Local\Temp\.newspaper_scraper\memoized

For Linux-based OSes, try looking in

/tmp/.newspaper_scraper/memoized/

For macOS, look in the directory specified by $TMPDIR. This may be any or none of the following:

/tmp/.newspaper_scraper/memoized/
/private/tmp/.newspaper_scraper/memoized/
~/Library/Caches/TemporaryItems/.newspaper_scraper/memoized/