How to get around Newspaper throwing 503 exceptions for certain webpages

757 views Asked by At

I'm trying to scrape a number of webpages using newspaper3k and my program is throwing 503 Exceptions. Can anyone help me identify the reason for this and help me get around it? To be exact, I'm not looking to catch these exceptions but to understand why they are occurring and prevent them if possible.

from newspaper import Article

dates = list()
titles = list()

urls = ['https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-02',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-mps-hearing-may-21',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-05-06',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-fsr-hearing-may-21',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-03-04',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-2019-20-reserve-bank-annual-review',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-12-02',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-28',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-22',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-19',
 'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-09-14']

for url in urls:
    speech = Article(url)
    speech.download()
    speech.parse()
    dates.append(speech.publish_date)
    titles.append(speech.title)

Here's my Traceback:

---------------------------------------------------------------------------
ArticleException                          Traceback (most recent call last)
<ipython-input-5-217a6cafe26a> in <module>
     20     speech = Article(url)
     21     speech.download()
---> 22     speech.parse()
     23     dates.append(speech.publish_date)
     24     titles.append(speech.title)

/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in parse(self)
    189 
    190     def parse(self):
--> 191         self.throw_if_not_downloaded_verbose()
    192 
    193         self.doc = self.config.get_parser().fromstring(self.html)

/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in throw_if_not_downloaded_verbose(self)
    529             raise ArticleException('You must `download()` an article first!')
    530         elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531             raise ArticleException('Article `download()` failed with %s on URL %s' %
    532                   (self.download_exception_msg, self.url))
    533 

ArticleException: Article `download()` failed with 503 Server Error: Service Temporarily Unavailable 
for url: https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29 
on URL https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
1

There are 1 answers

0
Life is complex On BEST ANSWER

Here is how you can troubleshoot the 503 Server Error: Service Temporarily Unavailable error with the Python Package Requests.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}

base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.status_code)
# output 
503 

Why are we getting a 503 Server Error?

Let's look at the content being returned by the server.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}

base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.text)
# output

truncated...

<title>Website unavailable - Reserve Bank of New Zealand - Te Pūtea Matua</title>

truncated...

<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>

truncated...

<form class="challenge-form" id="challenge-form" action="/research-and-publications/speeches/2021/speech2021-06-29?__cf_chl_jschl_tk__=73ad3f68fb15cc9284b25b7802626dd4ebe102cd-1625840173-0-ATQAZ5g7wCwLU2Q7agCqc1p59qs6ghpsYPVhDNwDN5r7vefk0P1UbjR4AJOUl0kUCZmDi-EVWX8XekL6VkqOgKTd1zqd5QWWlT3f2Dp_aUWQgCAH3bnS4x0wyc8-xGOLm-tcMKCXcTXH-OpiGoUX8u__bk1TIZ0gI_TYMB-oy0nJi7dMYLgJnvJhwhTllDoYUbCzmo2h2idIJPqIjNaAwupvbdpvHnrogPDnFhCe8Cco9-eKlq4w0G563f_OJ3M7YQChBjCoHYlT8baMoOLzP-Kb33rNmlG0uXhzoiIBROsPw9pavOrO1vsbqf31ZArDRuy0y7rsfrhAD7iU113zmypN81tgqgL_F8YTzygRvI_z3Cs2YOMxjB53-jq1pWwqsW_ItTaY7I3vh5lg_12EUzEddcwmuIj1wI2NbnA7EU06QNHYYn_Ye4TKM0gu9k4031hGybszE3nRKCdTXgMSgJbYhTJ6bJYPSb_2IHMUHlYyHksxePJ4C_5-5X8qIdJApSTFBfCLLLAZLrkFnBk7ep4" method="POST" enctype="application/x-www-form-urlencoded">
 
truncated...

var a = document.getElementById('cf-content');

truncated...

<p>Your access to the Reserve Bank website has been restricted. If you think you should be able to access our website please email <a href="mailto:[email protected]">[email protected]</a>.

If we looked at the returned text we can see that the website is asking for your browser to complete a challenge-form.. If you look at the additional data points (e.g. cf-content) in the text you can see that the website is being protected by CloudFlare.

Bypassing this protection is extremely difficult. Here is one of my recent answers on the complexity bypassing this protection.

Can't scrape product title from a webpage