I have the following Python code to start APScheduler/TwistedScheduler cronjob to start the spider.
Using one spider was not a problem and worked great. However using two spiders result into the error: twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed.
I did found a related question, using CrawlerRunner as the solution. However, I'm using TwistedScheduler object, so I do not know how to get this working using multiple cron jobs (multiple add_job()).
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler
from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider
process = CrawlerProcess(get_project_settings())
# Start the crawler in a scheduler
scheduler = TwistedScheduler(timezone="Europe/Amsterdam")
# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
scheduler.add_job(process.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
scheduler.add_job(process.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()
process.start(False)
I'm now using the BlockingScheduler in combination with
ProcessandCrawlerRunner. As well as enabling logging viaconfigure_logging().The script at least doesn't exit directly (it blocks). I now get the following output as expected:
Since we are using
BlockingSchedulerthe scheduler will not directly exit, butstart()is a blocking call. Meaning it allows the scheduler to run the jobs infinitely.