Storing responses as files using Scrapy Splash

833 views Asked by At

I'm creating my first scrapy project with Splash and work with the testdata from http://quotes.toscrape.com/js/ I want to store the quotes of each page as a separate file on disk (in the code below I first try to store the entire page). I have the code below, which worked when I was not using SplashRequest, but with the new code below, nothing is stored on disk now when I 'Run and debug' this code in Visual Studio Code. Also self.log does not write to my Visual Code Terminal window. I'm new to Splash, so I'm sure I'm missing something, but what?

Already checked here and here.

import scrapy
from scrapy_splash import SplashRequest

class QuoteItem(scrapy.Item):
    author = scrapy.Field()
    quote = scrapy.Field()   

class MySpider(scrapy.Spider):
    name = "jsscraper"

    
    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')

    def parse(self, response):
        for q in response.css("div.quote"):            
            quote = QuoteItem()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            yield quote

        #cycle through all available pages
        for a in response.css('ul.pager a'):
            yield SplashRequest(url=a,callback=self.parse,endpoint='render.html',args={ 'wait': 0.5 })

       
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

UPDATE 1

How I debug it:

  • In Visutal Studio Code, hit F5
  • Select 'Python file'

Output tab is empty

Terminal tab contains:

PS C:\scrapy\tutorial>  cd 'c:\scrapy\tutorial'; & 'C:\Users\Mark\AppData\Local\Programs\Python\Python38-32\python.exe' 'c:\Users\Mark\.vscode\extensions\ms-python.python-2020.9.114305\pythonFiles\lib\python\debugpy\launcher' '58582' '--' 'c:\scrapy\tutorial\spiders\quotes_spider_js.py'
PS C:\scrapy\tutorial> 

Also, nothing is logged in my Docker container instance, which I thought was required for Splash to work in the first place.

UPDATE 2

I ran scrapy crawl jsscraper and a file 'quotes-js.html' is stored on disk. However, it contains the page source HTML without any JavaScript code executed. I'm looking to execute the JS code on 'http://quotes.toscrape.com/js/' and store only the quote content. How can I do so?

2

There are 2 answers

12
Arslan Arif On BEST ANSWER

WRITING OUTPUT TO A JSON FILE:

I have tried to solve your problem. Here is the working version of your code. I hope this is what you are trying to achieve.

import json

import scrapy
from scrapy_splash import SplashRequest


class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                endpoint='render.html',
                args={'wait': 0.5}
            )

    def parse(self, response):
        quotes = {"quotes": []}
        for q in response.css("div.quote"):
            quote = dict()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            quotes["quotes"].append(quote)

        page = response.url[response.url.index("page/")+5:]
        print("page=", page)
        filename = 'quotes-%s.json' % page
        with open(filename, 'w') as outfile:
            outfile.write(json.dumps(quotes, indent=4, separators=(',', ":")))

UPDATE: Above code has been updated to scrape from all pages and save results in separate json files from page-1 to 10.

This will write the list of quotes from each page to a separate json file as following:

{
    "quotes":[
        {
            "author":"Albert Einstein",
            "quote":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
        },
        {
            "author":"J.K. Rowling",
            "quote":"\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
        },
        {
            "author":"Albert Einstein",
            "quote":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
        },
        {
            "author":"Jane Austen",
            "quote":"\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
        },
        {
            "author":"Marilyn Monroe",
            "quote":"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
        },
        {
            "author":"Albert Einstein",
            "quote":"\u201cTry not to become a man of success. Rather become a man of value.\u201d"
        },
        {
            "author":"Andr\u00e9 Gide",
            "quote":"\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
        },
        {
            "author":"Thomas A. Edison",
            "quote":"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
        },
        {
            "author":"Eleanor Roosevelt",
            "quote":"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
        },
        {
            "author":"Steve Martin",
            "quote":"\u201cA day without sunshine is like, you know, night.\u201d"
        }
    ]
}
6
pygeek On

Problem

JavaScript on website you wish to scrape isn’t being executed.

Solution

Increase ScrappyRequest wait time to allow JavaScript to execute.

Example

yield SplashRequest(
    url=url,
    callback=self.parse,
    endpoint='render.html',
    args={ 'wait': 0.5 }
)