I've written a script in scrapy to make a request pass through a custom middleware in order for that request to be proxied. However, the script doesn't seem to have any effect of that middleware. When I print response.meta, I get {'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.9680554866790771} which clearly indicates that my request is not passing through the custom middleware. I've used CrawlerProcess to run the script.

spider contains:

import scrapy
from scrapy.crawler import CrawlerProcess

class ProxySpider(scrapy.Spider):
    name = "proxiedscript"
    start_urls = ["https://httpbin.org/ip"]

    def parse(self,response):
        print(response.meta)
        print(response.text)

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(ProxySpider)
    c.start()

middleware contains:

class ProxiesMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://206.189.25.70:3128'
        return request

Change that I've made in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'proxyspider.middleware.ProxiesMiddleware': 100,
}

The following image shows the project hierarchy: enter image description here

What possible change should I bring about to make a proxied request through middleware?

2 Answers

1
Georgiy On Best Solutions

You need to check log output of this line: [scrapy.middleware] INFO: Enabled downloader middlewares: for list of active downloader middlewares. Your middleware should be in the list if it's active.

As far as I remember usage of scrapy.contrib modules deprecated now. Scrapy: No module named 'scrapy.contrib'

Your code with custom middleware is nearly ready for usage of scrapy command line tool
scrapy crawl proxiedscript.

Hovewer Your crawler process needs toread_projects_settings first if need to launch scrapy application as script.
or define DOWNLOADER_MIDDLEWARES setting as argument for CrawlerProcess:

c = CrawlerProcess({
    'USER_AGENT':'Mozilla/5.0',
    'DOWNLOADER_MIDDLEWARES':{
        #'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,#deprecated in scrapy 1.6
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':110, #enabled by default
        'proxyspider.middleware.ProxiesMiddleware': 100, 
                              },
    })
1
jspcal On

perhaps return None instead of a Request? Returning a Request prevents any other downloader middlewares from running.

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_request