I am using Scrapy 1.0.5 and Gearman to create distributed spiders. The idea is to build a spider, call it from a gearman worker script and pass 20 URLs at a time to crawl from a gearman client to the worker and then to the spider.
I am able to start the worker, pass URLs to it from the client on to the spider to crawl. The first URL or array of URLs do get picked up and crawled. Once the spider is done, I am unable to reuse it. I get the log message that the spider is closed. When I initiate the client again, the spider reopens, but doesn't crawl.
Here is my worker:
import gearman
import json
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
gm_worker = gearman.GearmanWorker(['localhost:4730'])
def task_listener_reverse(gearman_worker, gearman_job):
process = CrawlerProcess(get_project_settings())
data = json.loads(gearman_job.data)
if(data['vendor_name'] == 'walmart'):
process.crawl('walmart', url=data['url_list'])
process.start() # the script will block here until the crawling is finished
return 'completed'
# gm_worker.set_client_id is optional
gm_worker.set_client_id('python-worker')
gm_worker.register_task('reverse', task_listener_reverse)
# Enter our work loop and call gm_worker.after_poll() after each time we timeout/see socket activity
gm_worker.work()
Here is the code of my Spider.
from crawler.items import CrawlerItemLoader
from scrapy.spiders import Spider
class WalmartSpider(Spider):
name = "walmart"
def __init__(self, **kw):
super(WalmartSpider, self).__init__(**kw)
self.start_urls = kw.get('url')
self.allowed_domains = ["walmart.com"]
def parse(self, response):
item = CrawlerItemLoader(response=response)
item.add_value('url', response.url)
#Title
item.add_xpath('title', '//div/h1/span/text()')
if(response.xpath('//div/h1/span/text()')):
title = response.xpath('//div/h1/span/text()')
item.add_value('title', title)
yield item.load_item()
The first client run produces results and I get the data I need whether it was a single URL or multiple URLs.
On the second run, the spider opens and no results. This is what I get back and it stops
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
I was able to print the URL or URLs from the worker and spider and ensured they were getting passed on the first working run and second non working run. I spent 2 days and haven't gotten anywhere with it. I would appreciate any pointers.
Well, I decided to abandon Scrapy. I looked around a lot and everyone kept pointing to the limitation of the twisted reactor. Rather than fighting the framework, I decided to build my own scraper and it was very successful for what I needed. I am able to spin up multiple gearman workers and use the scraper I built to scrape the data concurrently in a server farm.
If anyone is interested, I started with this simple article to build the scraper. I use gearman client to query the DB and send multiple urls to a worker, the worker scrapes the URLs and does an update query back to the DB. Success!! :)
http://docs.python-guide.org/en/latest/scenarios/scrape/