python scrapy login redirecting problems

3.5k views Asked by At

I'm trying to use scrapy to crawl a website, but I'm not able to login to my account through scrapy. Here is the spider code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from images.items import ImagesItem
from scrapy.http import Request
from scrapy.http import FormRequest
from loginform import fill_login_form
import requests
import os
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.shell import inspect_response

class ImageSpider(BaseSpider):
   counter=0;
   name = "images"
   start_urls=['https://poshmark.com/login']
   f=open('poshmark.txt','wb')
   if not os.path.exists('./Image'):
      os.mkdir('./Image')

   def parse(self,response):
       return [FormRequest("https://www.poshmark.com/login",
                    formdata={
                         'login_form[username_email]':'Oliver1234',
                         'login_form[password]':'password'},
                    callback=self.real_parse)]

   def real_parse(self, response):
      print 'you are here'
      rq=[]
      mainsites=response.xpath("//body[@class='two-col feed one-col']/div[@class='body-con']/div[@class='main-con clear-fix']/div[@class='right-col']/div[@id='tiles']/div[@class='listing-con shopping-tile masonry-brick']/a/@href").extract()
      for mainsite in mainsites:
         r=Request(mainsite, callback=self.get_image)
         rq.append(r)
      return rq
   def get_image(self, response):
      req=[]
      sites=response.xpath("//body[@class='two-col small fixed']/div[@class='body-con']/div[@class='main-con']/div[@class='right-col']/div[@class='listing-wrapper']/div[@class='listing']/div[@class='img-con']/img/@src").extract()
      for site in sites:
         r = Request('http:'+site, callback=self.DownLload)
         req.append(r)
      return req

   def DownLload(self, response):
      str = response.url[0:-3];
      self.counter = self.counter+1
      str = str.split('/');
      print '----------------Image Get----------------',self.counter,str[-1],'jpg'
      imgfile = open('./Image/'+str[-1]+"jpg",'wb')
      imgfile.write(response.body)
      imgfile.close()

And I get the command window output like below:

C:\Python27\Scripts\tutorial\images>scrapy crawl images
C:\Python27\Scripts\tutorial\images\images\spiders\images_spider.py:14: ScrapyDe
precationWarning: images.spiders.images_spider.ImageSpider inherits from depreca
ted class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (w
arning only on first subclass, there may be others)
  class ImageSpider(BaseSpider):
2015-06-09 23:43:29-0400 [scrapy] INFO: Scrapy 0.24.6 started (bot: images)
2015-06-09 23:43:29-0400 [scrapy] INFO: Optional features available: ssl, http11

2015-06-09 23:43:29-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'images.spiders', 'SPIDER_MODULES': ['images.spiders'], 'BOT_NAME': 'images'}
2015-06-09 23:43:29-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetCons
ole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-09 23:43:30-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-09 23:43:30-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-06-09 23:43:30-0400 [scrapy] INFO: Enabled item pipelines:
2015-06-09 23:43:30-0400 [images] INFO: Spider opened
2015-06-09 23:43:30-0400 [images] INFO: Crawled 0 pages (at 0 pages/min), scrape
d 0 items (at 0 items/min)
2015-06-09 23:43:30-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-06-09 23:43:30-0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080

2015-06-09 23:43:33-0400 [images] DEBUG: Crawled (200) <GET https://poshmark.com
/login> (referer: None)
2015-06-09 23:43:35-0400 [images] DEBUG: Redirecting (302) to <GET https://www.p
oshmark.com/feed> from <POST https://www.poshmark.com/login>
2015-06-09 23:43:35-0400 [images] DEBUG: Redirecting (301) to <GET https://poshm
ark.com/feed> from <GET https://www.poshmark.com/feed>
2015-06-09 23:43:36-0400 [images] DEBUG: Redirecting (302) to <GET https://poshm
ark.com/login?pmrd%5Burl%5D=%2Ffeed> from <GET https://poshmark.com/feed>
2015-06-09 23:43:36-0400 [images] DEBUG: Redirecting (301) to <GET https://poshm
ark.com/login> from <GET https://poshmark.com/login?pmrd%5Burl%5D=%2Ffeed>
2015-06-09 23:43:37-0400 [images] DEBUG: Crawled (200) <GET https://poshmark.com
/login> (referer: https://poshmark.com/login)
you are here
2015-06-09 23:43:37-0400 [images] INFO: Closing spider (finished)
2015-06-09 23:43:37-0400 [images] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 4213,
         'downloader/request_count': 6,
         'downloader/request_method_count/GET': 5,
         'downloader/request_method_count/POST': 1,
         'downloader/response_bytes': 9535,
         'downloader/response_count': 6,
         'downloader/response_status_count/200': 2,
         'downloader/response_status_count/301': 2,
         'downloader/response_status_count/302': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 6, 10, 3, 43, 37, 213000),
         'log_count/DEBUG': 8,
         'log_count/INFO': 7,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 6,
         'scheduler/dequeued/memory': 6,
         'scheduler/enqueued': 6,
         'scheduler/enqueued/memory': 6,
         'start_time': datetime.datetime(2015, 6, 10, 3, 43, 30, 788000)}
2015-06-09 23:43:37-0400 [images] INFO: Spider closed (finished)

You can see that it is redirected to ./feed from ./login, which seems like a successful login, at the beginning, but then redirected back to ./login in the end. Any ideas about what might be causing it?

1

There are 1 answers

0
vicg On BEST ANSWER

When you log onto a website it stores (depending on the method of authentication it could be different) some sort of token on the users session. The problem you are having is that while you are getting authenticated properly, your session data (the way the browser is able to tell the server you are logged in and you are who you say you are) isn't being saved.

The person in this thread seems to have managed to do what you are seeking to do here:

Crawling with an authenticated session in Scrapy

and here:

Using Scrapy with authenticated (logged in) user session