I want to scrape data in a french website with newspaper3k and the result will be only 50 articles. This website has much more than 50 articles. Where am I wrong ?
My goal is to scrape all the articles in this website.
I tried this:
import newspaper
legorafi_paper = newspaper.build('http://www.legorafi.fr/', memoize_articles=False)
# Empty list to put all urls
papers = []
for article in legorafi_paper.articles:
papers.append(article.url)
print(legorafi_paper.size())
The result of this print is 50 articles.
I don't understand why newspaper3k will only scrape 50 articles and not much more.
UPDATE OF WHAT I TRIED:
def Foo(firstTime = []):
if firstTime == []:
WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"div#appconsent>iframe")))
firstTime.append('Not Empty')
else:
print('Cookies already accepted')
%%time
categories = ['societe', 'politique']
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import newspaper
import requests
from newspaper.utils import BeautifulSoup
from newspaper import Article
categories = ['people', 'sports']
papers = []
driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")
driver.get('http://www.legorafi.fr/')
for category in categories:
url = 'http://www.legorafi.fr/category/' + category
#WebDriverWait(self.driver, 10)
driver.get(url)
Foo()
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.button--filled>span.baseText"))).click()
pagesToGet = 2
title = []
content = []
for page in range(1, pagesToGet+1):
print('Processing page :', page)
#url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)
print(driver.current_url)
#print(url)
time.sleep(3)
raw_html = requests.get(url)
soup = BeautifulSoup(raw_html.text, 'html.parser')
for articles_tags in soup.findAll('div', {'class': 'articles'}):
for article_href in articles_tags.find_all('a', href=True):
if not str(article_href['href']).endswith('#commentaires'):
urls_set.add(article_href['href'])
papers.append(article_href['href'])
for url in papers:
article = Article(url)
article.download()
article.parse()
if article.title not in title:
title.append(article.title)
if article.text not in content:
content.append(article.text)
#print(article.title,article.text)
time.sleep(5)
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
driver.find_element_by_xpath("//a[contains(text(),'Suivant')]").click()
time.sleep(10)
UPDATE 09-21-2020
I rechecked your code and it is working correctly, because it is extracting all the articles on the main page of Le Gorafi. The articles on this page are highlights from the category pages, such as societe, sports, etc.
The example below is from the main page's source code. Each of these articles is also listed on the category page sports.
It seems that there are 35 unique article entries on the main page.
If I change the URL in the code above to this: http://www.legorafi.fr/category/sports, it returns the same number of articles as http://www.legorafi.fr. After looking at the source code for Newspaper on GitHub, it seems that the module is using urlparse, which seems to be using the netloc segment of urlparse. The netloc is www.legorafi.fr. I noted that this is a known problem with Newspaper based on this open issue.
To obtain all the articles it becomes more complex, because you have to use some additional modules, including requests and BeautifulSoup. The latter can be called from Newspaper. The code below can be refined to obtain all the articles within the source code on the main page and category pages using requests and BeautifulSoup.
If you need to obtain the articles listed in the subpages of a category page (politique currently has 120 subpages) then you would have to use something like Selenium to click the links.
Hopefully, this code helps you get closer to achieving your objective.