I'm creating my first scrapy project with Splash and work with the testdata from http://quotes.toscrape.com/js/
I want to store the quotes of each page as a separate file on disk (in the code below I first try to store the entire page). I have the code below, which worked when I was not using SplashRequest
, but with the new code below, nothing is stored on disk now when I 'Run and debug' this code in Visual Studio Code.
Also self.log
does not write to my Visual Code Terminal window. I'm new to Splash, so I'm sure I'm missing something, but what?
Already checked here and here.
import scrapy
from scrapy_splash import SplashRequest
class QuoteItem(scrapy.Item):
author = scrapy.Field()
quote = scrapy.Field()
class MySpider(scrapy.Spider):
name = "jsscraper"
start_urls = ["http://quotes.toscrape.com/js/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')
def parse(self, response):
for q in response.css("div.quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote
#cycle through all available pages
for a in response.css('ul.pager a'):
yield SplashRequest(url=a,callback=self.parse,endpoint='render.html',args={ 'wait': 0.5 })
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
UPDATE 1
How I debug it:
- In Visutal Studio Code, hit F5
- Select 'Python file'
Output tab is empty
Terminal tab contains:
PS C:\scrapy\tutorial> cd 'c:\scrapy\tutorial'; & 'C:\Users\Mark\AppData\Local\Programs\Python\Python38-32\python.exe' 'c:\Users\Mark\.vscode\extensions\ms-python.python-2020.9.114305\pythonFiles\lib\python\debugpy\launcher' '58582' '--' 'c:\scrapy\tutorial\spiders\quotes_spider_js.py'
PS C:\scrapy\tutorial>
Also, nothing is logged in my Docker container instance, which I thought was required for Splash to work in the first place.
UPDATE 2
I ran scrapy crawl jsscraper
and a file 'quotes-js.html' is stored on disk. However, it contains the page source HTML without any JavaScript code executed. I'm looking to execute the JS code on 'http://quotes.toscrape.com/js/' and store only the quote content. How can I do so?
WRITING OUTPUT TO A JSON FILE:
I have tried to solve your problem. Here is the working version of your code. I hope this is what you are trying to achieve.
UPDATE: Above code has been updated to scrape from all pages and save results in separate json files from page-1 to 10.
This will write the list of quotes from each page to a separate json file as following: