How do I scrape from this website using scrapy and splash?

582 views Asked by At

I'm a newbie and I'm trying to scrape the href link of each place listed in this website. Then I want to go into each link and scrape data but I'm not even able to get the href links from this code. However, I'm able to use the same xpath selector in the Scrapy shell to get the href.

import scrapy
from scrapy_splash import SplashRequest

class TestspiSpider(scrapy.Spider):
    name = 'testspi'
    allowed_domains = ["powersearch.jll.com"]
    start_urls = ["https://powersearch.jll.com/us-en/property/search"]
    
    def start_requests(self):
    

        for url in self.start_urls:
            yield SplashRequest(url=url,callback= self.parse, args={'wait':5})
            
    def parse(self,response):
        
        properties=response.xpath('//*[@class="ssr__container"]').extract()
        print (properties)
        print ("HELLO WORLD")

When I run the code, I get an empty list. Here's the output:

2020-09-03 19:58:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-03 19:58:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-03 19:58:49 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
  url = to_native_str(url)

2020-09-03 19:58:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://powersearch.jll.com/us-en/property/search via http://localhost:8050/render.html> (referer: None)
[]
HELLO WORLD
2020-09-03 19:58:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-03 19:58:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 535,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 148739,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 9.802616,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 9, 3, 14, 28, 59, 274213),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 51179520,
 'memusage/startup': 51179520,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2020, 9, 3, 14, 28, 49, 471597)}
2020-09-03 19:58:59 [scrapy.core.engine] INFO: Spider closed (finished)

please help me fix this

1

There are 1 answers

0
Ryan On

In your case, I don't believe that Splash is required.

If you take a look at the webpage via your browser's developer tools, you will see that there is an API that is loading the properties. enter image description here

You could have a standard scrapy spider call that API and request each page of properties:

import json
import scrapy


class TestspiSpider(scrapy.Spider):
    name = 'testspi'

    api_url = "https://powersearchapi.jll.com/api/search/properties/v2?queries%5B0%5D.type=1&queries%5B0%5D.term=United%20States%20of%20America&queries%5B0%5D.isStateOrCountry=true&options.siteOrganizationId=11111111-1111-1111-1111-111111111111&options.unitOfMeasurement=1&options.currencyCode=USD&options.page={page}&options.perPage=24&options.sort=3&options.sortDir=1&options.searchMultiplier=1"

    start_urls = [
        api_url.format(page=1)
    ]

    def parse(self, response):
        data = json.loads(response.text)
        properties = data.get('results')

        if properties:
            # If no current page in meta, set as first page
            current_page = response.meta.get('page') or 1
            next_page = current_page + 1
            yield scrapy.Request(
                self.api_url.format(page=next_page),
                meta={
                    'page': next_page
                },
                callback=self.parse
            )