Scrapy - Build URLs Dynamically Based on HTTP Status Code?

361 views Asked by At

I'm just getting started with Scrapy and I went through the tutorial, but I'm running into an issue that either I can't find the answer to in the tutorial and/or docs, or I've read the answer multiple times now, but I'm just not understanding properly...

Scenario:

Let's say I have exactly 1 website that I would like to crawl. Content is rendered dynamically based on query params passed in url. I will need to scrape for 3 "sets" of data based on URL pram of "category".

All the information I need can be grabbed from common base URLs like this:

"http://shop.somesite.com/browse/?product_type=instruments"

And the URls for each category like so:

"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums"

"http://shop.somesite.com/browse/?q=&product_type=instruments&category=keyboards"

"http://shop.somesite.com/browse/?q=&product_type=instruments&category=guitars"

The one caveat here, is that the site is only loading 30 results per initial request. If the user wants to view more, they have to click on button "Load More Results..." at the bottom. After investigating this a bit, during initial load of page, only the request for top 30 is made (which makes sense), and after clicking the "Load More.." button, the URL is updated with a "pagex=2" appended and the container refreshes with 30 more results. After this, the button goes away and as user continues to scroll down the page, subsequent requests are made to the server to get the next 30 results, "pagex" value is incremented by one, container refreshed with results appended, rinse and repeat.

I'm not exactly sure how to handle pagination on sites, but the simplest solution I came up with is simply finding out what the max number "pagex" for each category, and just set the URLs to that number for starters.

For example, if you pass URL in browser:

"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums&pagex=22"

HTTP Response Code 200 is received and all results are rendered to page. Great! That gives me what I need!

But, say next week or so, 50 more items added, so now the max is "...pagex=24" I wouldn't get all the latest.

Or is 50 items removed and new max is "...pagex=20", I will get 404 response when requesting "22".

I would like to send a test response with the last known "good" max page number and based on HTTP Response provided, use that to decide what URL will be.

So, before I start any crawling, I would like to add 1 to "pagex" and check for 404. If 404 I know I'm still good, if I get 200, I need to keep adding 1 until I get 404, so I know where max is (or decrease if needed).

I can't seem to figure out if I can do this using Scrapy, of I have to use a different module to run this check first. I tried adding simple checks for testing purposes in the "parse" and "start_requests" methods, and no luck. start_requests doesn't seem to be able to handle responses and parse can check the response code, but will not update the URL as instructed.

I'm sure it's my poor coding skills (still new to this all), but I can't seem to find a viable solution....

Any thoughts or ideas are very much appreciated!

1

There are 1 answers

1
eLRuLL On BEST ANSWER

you can configure in scrapy which statuses to configure, that way you can make decisions for example in the parse method according to the response.status. Check how to handle statuses in the documentation. Example:

class MySpider(CrawlSpider):
    handle_httpstatus_list = [404]