How do I scrape from this website using scrapy and splash?

Question

How do I scrape from this website using scrapy and splash?

600 views Asked by prashannaa rajan At 03 September 2020 at 14:37

I'm a newbie and I'm trying to scrape the href link of each place listed in this website. Then I want to go into each link and scrape data but I'm not even able to get the href links from this code. However, I'm able to use the same xpath selector in the Scrapy shell to get the href.

import scrapy
from scrapy_splash import SplashRequest

class TestspiSpider(scrapy.Spider):
    name = 'testspi'
    allowed_domains = ["powersearch.jll.com"]
    start_urls = ["https://powersearch.jll.com/us-en/property/search"]
    
    def start_requests(self):
    

        for url in self.start_urls:
            yield SplashRequest(url=url,callback= self.parse, args={'wait':5})
            
    def parse(self,response):
        
        properties=response.xpath('//*[@class="ssr__container"]').extract()
        print (properties)
        print ("HELLO WORLD")

When I run the code, I get an empty list. Here's the output:

2020-09-03 19:58:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-03 19:58:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-03 19:58:49 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
  url = to_native_str(url)

2020-09-03 19:58:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://powersearch.jll.com/us-en/property/search via http://localhost:8050/render.html> (referer: None)
[]
HELLO WORLD
2020-09-03 19:58:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-03 19:58:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 535,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 148739,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 9.802616,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 9, 3, 14, 28, 59, 274213),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 51179520,
 'memusage/startup': 51179520,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2020, 9, 3, 14, 28, 49, 471597)}
2020-09-03 19:58:59 [scrapy.core.engine] INFO: Spider closed (finished)

please help me fix this

Original Q&A

There are 1 answers

**Ryan** · Answer 1 · 2020-09-03T17:17:44+00:00

In your case, I don't believe that Splash is required.

If you take a look at the webpage via your browser's developer tools, you will see that there is an API that is loading the properties.

You could have a standard scrapy spider call that API and request each page of properties:

import json
import scrapy


class TestspiSpider(scrapy.Spider):
    name = 'testspi'

    api_url = "https://powersearchapi.jll.com/api/search/properties/v2?queries%5B0%5D.type=1&queries%5B0%5D.term=United%20States%20of%20America&queries%5B0%5D.isStateOrCountry=true&options.siteOrganizationId=11111111-1111-1111-1111-111111111111&options.unitOfMeasurement=1&options.currencyCode=USD&options.page={page}&options.perPage=24&options.sort=3&options.sortDir=1&options.searchMultiplier=1"

    start_urls = [
        api_url.format(page=1)
    ]

    def parse(self, response):
        data = json.loads(response.text)
        properties = data.get('results')

        if properties:
            # If no current page in meta, set as first page
            current_page = response.meta.get('page') or 1
            next_page = current_page + 1
            yield scrapy.Request(
                self.api_url.format(page=next_page),
                meta={
                    'page': next_page
                },
                callback=self.parse
            )

TechQA.

How do I scrape from this website using scrapy and splash?

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in SCRAPY

Related Questions in SCRAPY-SPLASH

Related Questions in SPLASH-JS-RENDER

Popular Questions

Popular Tags

Trending Questions