I'm a newbie and I'm trying to scrape the href
link of each place listed in this website. Then I want to go into each link and scrape data but I'm not even able to get the href links from this code. However, I'm able to use the same xpath selector in the Scrapy shell to get the href
.
import scrapy
from scrapy_splash import SplashRequest
class TestspiSpider(scrapy.Spider):
name = 'testspi'
allowed_domains = ["powersearch.jll.com"]
start_urls = ["https://powersearch.jll.com/us-en/property/search"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,callback= self.parse, args={'wait':5})
def parse(self,response):
properties=response.xpath('//*[@class="ssr__container"]').extract()
print (properties)
print ("HELLO WORLD")
When I run the code, I get an empty list. Here's the output:
2020-09-03 19:58:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-03 19:58:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-03 19:58:49 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)
2020-09-03 19:58:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://powersearch.jll.com/us-en/property/search via http://localhost:8050/render.html> (referer: None)
[]
HELLO WORLD
2020-09-03 19:58:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-03 19:58:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 535,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 148739,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 9.802616,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 3, 14, 28, 59, 274213),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 51179520,
'memusage/startup': 51179520,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2020, 9, 3, 14, 28, 49, 471597)}
2020-09-03 19:58:59 [scrapy.core.engine] INFO: Spider closed (finished)
please help me fix this
In your case, I don't believe that Splash is required.
If you take a look at the webpage via your browser's developer tools, you will see that there is an API that is loading the properties.
You could have a standard scrapy spider call that API and request each page of properties: