I have a scrapy project that use middleware install via pip.
More specifically scrapy-random-useragent.
Setting file # -- coding: utf-8 --
# Scrapy settings for batdongsan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'batdongsan'
SPIDER_MODULES = ['batdongsan.spiders']
NEWSPIDER_MODULE = 'batdongsan.spiders'
FEED_EXPORT_ENCODING = 'utf-8' # make output in json become human readable utf-8
CLOSESPIDER_PAGECOUNT = 10 # limit the number of page crawl
LOG_LEVEL = 'INFO' # write less log
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
#'batdongsan.middlewares.MyCustomDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 400
}
USER_AGENT_LIST = "agents.txt"
The scrapy project run fine on my machine.
I deploy on scrapinghub using linked github project.
I got the error on logs on scrapinghub.
File "/usr/local/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 168, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 172, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1445, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 77, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 102, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
exceptions.ImportError: No module named random_useragent
it is clear that the problem is No module named random_useragent
.
But I don't know how to install that module via pip on Scrapinghub.
When linking GitHub repositories with Python dependencies on Scrapinghub, you'll need to have 2 files at the root of your repository (that is at the same level as your
scrapy.cfg
file):scrapinghub.yml
requirements.txt
They should contain the same things as detailed in the
shub deploy
section from their docs:scrapinghub.yml:
requirements.txt