Scrapy encountered http status <521>

1.3k views Asked by At

I am new to scrpay, and tried to crawl a website page but was returned http status code <521>

Is it mean the server refuse to be connected? ( i can open it by browser) I tried to use cookie setting, but still returned with 521.

Question:

  1. what's the reason i met with 521 status code?

  2. is it because of the cookie setting? am i wrong in my code about cookie setting?

  3. how can I crawl this page?

Thank you very much for your help!

The log:

2015-06-07 08:27:26+0800 [scrapy] INFO: Scrapy 0.24.6 started (bot: ccdi)
2015-06-07 08:27:26+0800 [scrapy] INFO: Optional features available: ssl, http11
2015-06-07 08:27:26+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ccdi.spiders', 'FEED_URI': '412.json', 'SPIDER_MODULES': ['ccdi.spiders'], 'BOT_NAME': 'ccdi', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3)AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 2}
2015-06-07 08:27:26+0800 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew are
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled item pipelines:
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider opened
2015-06-07 08:27:27+0800 [ccdi] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Crawled (521) <GET http://www.ccdi.gov.cn/jlsc/index_2.html> (referer: None)
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Ignoring response <521 http://www.ccdi.gov.cn/jlsc/index_2.html>: HTTP status code is not handled or not allowed
2015-06-07 08:27:27+0800 [ccdi] INFO: Closing spider (finished)
2015-06-07 08:27:27+0800 [ccdi] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 537,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 512,
     'downloader/response_count': 1,
     'downloader/response_status_count/521': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 468000),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 359000)}   
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider closed (finished)

My original code:

#encoding: utf-8

import sys
import scrapy
import re
from scrapy.selector import Selector
from scrapy.http.request import Request
from ccdi.items import CcdiItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule

class CcdiSpider(CrawlSpider):
 name = "ccdi"
 allowed_domains = ["ccdi.gov.cn"]
 start_urls = "http://www.ccdi.gov.cn/jlsc/index_2.html"
 #rules = (
 # Rule(SgmlLinkExtractor(allow=r"/jlsc/+", ),
 # callback="parse_ccdi", follow=True),
 # 
 #)
 
 def start_requests(self):
  yield Request(self.start_urls, cookies={'NAME':'Value'},callback=self.parse_ccdi)

 
 def parse_ccdi(self, response):
  item = CcdiItem()
  self.get_title(response, item)
  self.get_url(response, item)
  self.get_time(response, item)
  self.get_keyword(response, item)
  self.get_text(response, item)
  return item

 def get_title(self,response,item):
   title = response.xpath("/html/head/title/text()").extract()
   if title:
    item['ccdi_title']=title

 def  get_text(self,response,item):
   ccdi_body=response.xpath("//div[@class='TRS_Editor']/div[@class='TRS_Editor']/p/text()").extract()
   if ccdi_body:
    item['ccdi_body']=ccdi_body 
    
 def  get_time(self,response,item):
   ccdi_time=response.xpath("//em[@class='e e2']/text()").extract()
   if ccdi_time:
    item['ccdi_time']=ccdi_time[0][5:] 
    
 def get_url(self,response,item):
   ccdi_url=response.url
   if ccdi_url:
    print ccdi_url 
    item['ccdi_url']=ccdi_url

 def  get_keyword(self,response,item):
   ccdi_keyword=response.xpath("//html/head/meta[@http-equiv = 'keywords']/@content").extract()
   if ccdi_keyword:
    item['ccdi_keyword']=ccdi_keyword

1

There are 1 answers

0
Steffen Schmitz On

The HTTP status code 521 is a custom error code sent by Cloudflare and usually means that the web server is down: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#521error

In my case the error did not occur anymore after setting a custom USER_AGENT in my settings.py.

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'crawler (+http://example.com)'