pymongo.errors.ConnectionFailure: timed out from an ubuntu ec2 instance running scrapyd

1.2k views Asked by At

So... I'm running scrapyd on my ubuntu ec2 instance after following this post: http://www.dataisbeautiful.io/deploying-scrapy-ec2/

however I guess I can't get pymongo to connect to my MongoLabs mongo database, since the ubuntu ec2 scrapyd logs are saying

pymongo.errors.ConnectionFailure: timed out

I'm a real noob when it comes to back end stuff, so I don't really have any idea what could be causing this issue. When I run my scrapyd from localhost, it works totally fine, and saves the scraped data to my MongoLabs db. For my scrapyd running on the ec2 instance, I can access the scrapyd gui by typing in the ec2 address at port 6800 (equivalent to scrapyd's localhost:6800), but that's about it. Curling

curl http://aws-ec2-link:6800/schedule.json -d project=sportslab_scrape -d spider=max -d max_url="http://www.maxpreps.com/high-schools/de-la-salle-spartans-(concord,ca)/football/stats.htm"

gives back 'status': 'okay' and I can see the job appear, but no items are produced and the log only shows

2014-11-17 02:20:13+0000 [scrapy] INFO: Scrapy 0.24.4 started (bot: sportslab_scrape_outer)
2014-11-17 02:20:13+0000 [scrapy] INFO: Optional features available: ssl, http11
2014-11-17 02:20:13+0000 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sportslab_scrape.spiders', 'SPIDER_MODULES': ['sportslab_scrape.spiders'], 'FEED_URI': 'items/sportslab_scrape/max/4299afa26e0011e4a543060f585a893f.jl', 'LOG_FILE': 'logs/sportslab_scrape/max/4299afa26e0011e4a543060f585a893f.log', 'BOT_NAME': 'sportslab_scrape_outer'}
2014-11-17 02:20:13+0000 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-11-17 02:20:13+0000 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-17 02:20:13+0000 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

Anyone got some helpful insights for my issue? Thanks!

edit: Connection code added. Settings.py

MONGODB_HOST = 'mongodb://user:[email protected]:38839/sportslab_mongodb' 
MONGODB_PORT = 38839 # Change in prod
MONGODB_DATABASE = "sportslab_mongodb" # Change in prod
MONGODB_COLLECTION = "sportslab"

Scrapy's Pipeline.py

from pymongo import Connection
from scrapy.conf import settings

class MongoDBPipeline(object):
    def __init__(self):
        connection = Connection(settings['MONGODB_HOST'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DATABASE']]
        self.collection = db[settings['MONGODB_COLLECTION']]
    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        return item
1

There are 1 answers

0
pyramidface On BEST ANSWER

I solved the issue. Initially, I set up my ec2's security group's outbound rules as:

Outbound
Type:HTTP, Protocol: TCP, Port Range:80, Destination: 0.0.0.0/0
Type:Custom, Protocol: TCP, Port Range: 6800, Destination: 0.0.0.0/0
Type:HTTPS, Protocol: TCP, Port Range:443, Destination 0.0.0.0/0

However, this wasn't enough as I also needed a specific Custom TCP Protocol for the actual port of the mongolab db I was connecting to, which should look like this...

Type:Custom, Protocol: TCP, Port Range: 38839, Destination: 0.0.0.0/0