Scrape an aspx site using scrapy

Asked by At

I am trying to scrape the site pasted below and iterate through each page, retrieving both the job titles and the dates that each job was posted. I cannot seem to scrape more than the first page. The site is ASPX and so it requires sending postbacks to the server to retrieve the information. I have attempted to simulate these using scrapy but have not had any success. I understand that there are tools you can use to do this (like Selenium) but I am hoping I can find a way to do this using scrapy. This resource has been some help: https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition But their problem is also slightly different from mine and I am tired of banging my head against the wall.

So what I have done so far is come up with a general idea of how I want the scraper to work:

  1. Scrape the first page
  2. Find the arrow button that goes to the next page
  3. Submit a form request that will simulate sending a form to the page to retrieve the information for the next page
  4. Scrape the returned information
  5. Repeat

My biggest issue is that I am not very familiar with ASP pages but I do know that with each page, the browser must do a postback but I cannot seem to submit the information the correct way for it to iterate through each page of information.

Here is my current code (it has changed countless times). I recognize that this will not work as is (The second portion of parse is clearly not thought out well). I am just hoping this might be a good starting point for someone else to view it and help me get the answer.

import scrapy, re
from ..items import UnionItems

class Union(scrapy.Spider):
    name = "union"
    custom_settings = {
    'ITEM_PIPELINES': {
    'tutorial.pipelines.UnionHospital': 300,
    'tutorial.pipelines.MongoDBPipeline': 300,
    }
#   'DEFAULT_REQUEST_HEADERS': {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#   'Referer': 'https://re21.ultipro.com/UNI1029/JobBoard/ListJobs.aspx?__SVRTRID=05EB3802-DC15-42A8-9843-188F8787D187'


# }
    }

    start_urls = [
    'https://re21.ultipro.com/UNI1029/JobBoard/ListJobs.aspx?__SVRTRID=FFDDC66F-AA37-4484-A868-96DF06DA013C',
    ]


    def parse(self,response):
        items = UnionItems()
        job_title = response.css('.highlightRow:nth-child(3) .LinkSmall , .Cell:nth-child(3) .LinkSmall').css('::text').extract()
        dates_posted = response.css('.highlightRow:nth-child(1) .LinkSmall , .Cell:nth-child(1) .LinkSmall').css('::text').extract()
        items['job_title'] = job_title
        items['date_posted'] = dates_posted
        yield items
        i = 2
        for arrow in response.css('input#__Next::attr(value)').extract_first():
            yield scrapy.FormRequest.from_response(
                response,
                method="POST",
                formdata={
                '__Next': response.css('input#__Next::attr(value)').extract_first(),
                '__PageNumber':i,
                '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
                }
                )
            i = i+1

This code is successful in scraping the first pages of jobs and posted dates but after that, I cannot figure out how to retrieve the next few pages.

0 Answers