Read cookies from Splash request

3.6k views Asked by At

I'm trying to access cookies after I've made a request using Splash. Below is how I've build the request.

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""
req = SplashRequest(
    url,
    self.parse_page,
    args={
        'wait': 0.5,
        'lua_source': script,
        'endpoint': 'execute'
    }
)

The script is an exact copy from Splash documentation.

So I'm trying to access the cookies that are set on the webpage. When I'm not using Splash the code below works as I expect it to, but not when using Splash.

self.logger.debug('Cookies: %s', response.headers.get('Set-Cookie'))

This returns while using Splash:

2017-01-03 12:12:37 [spider] DEBUG: Cookies: None

When I'm not using Splash this code works and returns the cookies provided by the webpage.

The documentation of Splash shows this code as example:

def parse_result(self, response):
    # here response.body contains result HTML;
    # response.headers are filled with headers from last
    # web page loaded to Splash;
    # cookies from all responses and from JavaScript are collected
    # and put into Set-Cookie response header, so that Scrapy
    # can remember them.

I'm not sure whether I'm understanding this correctly, but I'd say I should be able to access the cookies in the same way as when I'm not using Splash.

Middleware settings:

# Download middlewares 
DOWNLOADER_MIDDLEWARES = {
    # Use a random user agent on each request
    'crawling.middlewares.RandomUserAgentDownloaderMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    # Enable crawlera proxy
    'scrapy_crawlera.CrawleraMiddleware': 600,
    # Enable Splash to render javascript
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
}

So my question is: how do I access cookies while using a Splash request?

Settings.py

spider.py

2

There are 2 answers

10
Mikhail Korobov On

You can set SPLASH_COOKIES_DEBUG=True option to see all cookies which are being set. Current cookiejar, with all cookies merged, is available as response.cookiejar when scrapy-splash is configured correctly.

Using response.headers.get('Set-Header') is not robust because in case of redirects (e.g. JS redirects) there could be several responses, and a cookie could be set in the first, while script returns headers only for the last response.

I'm not sure if this is a problem you're having though; the code is not an exact copy from Splash docs. Here:

req = SplashRequest(
    url,
    self.parse_page,
    args={
        'wait': 0.5,
        'lua_source': script
    }
) 

you're sending request to the /render.json endpoint; it doesn't execute Lua scripts; use endpoint='execute' to fix that.

0
Franz Gastring On

You are trying to get the data from "static" headers sent by the server side but the js code in the page can generate cookies too. This explains why splash uses "splash:get_cookies()". To access the values from the "cookies" on response you should use the table returned by the lua script.

return {
   url = splash:url(),
   headers = last_response.headers,
   http_status = last_response.status,
   cookies = splash:get_cookies(),
   html = splash:html(),
}

Try to change this line

self.logger.debug('Cookies: %s', response.headers.get('Set-Cookie'))

to

self.logger.debug('Cookies: %s', response.cookies)