Scrapy with playwright - scraping immoweb

Question

Scrapy with playwright - scraping immoweb

302 views Asked by Sarkis T. At 16 May 2023 at 08:24

Configuration: working on WSL with ubuntu terminal. Coding in python with vs code. Installed modules : scrapy, scrapy-playwright, playwright

Project: extract the data from the website www.immoweb.be (belgian real estate website). Javascript components present, hence the playwright module.

Starting url : search results for houses and apartments across Belgium

Here's the code I'm running.

import scrapy
from scrapy_playwright.page import PageMethod

class ImmoSpider(scrapy.Spider):
    name = "immospider"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.immoweb.be/en/search/house-and-apartment/for-sale?countries=BE&page=1&orderBy=relevance",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", 
            'article.card.card--result.card--xl'),
                ],
            },
        )

    async def parse(self, response):
        properties = response.css('article.card.card--result.card--xl')

        **#untested loop. Goal : go through every page and scrape the data from every card**
        
        """
        for page_num in range(1, 10):
                url = f'https://www.immoweb.be/en/search/house-and-apartment/for-sale?countries=BE&page={page_num}&orderBy=relevance'
                yield Request(url=url, callback=self.parse, 
                                        meta={'page': page_num})
        """

        for property in properties:
            #link = response.urljoin(property.xpath('.//a[text()]/@href').get
            url = property.css('h2 a::attr(href)').get()
            yield scrapy.Request(url,
                                 callback=self.parse_product,
                                 meta={
                                     "playwright": False
                                 }
                                 )

    async def parse_product(self, response):
        yield {
            'url' : response.url,
            'Price' : response.css('.classified__header-primary-info p.classified__price span.sr-only::text').get(),
            'Living Area' : response.css('#accordion_eeca443b-8b41-4284-b4af-5ab3f1622768 td.classified-table__data::text').get(),
            'Locality': response.css('span.classified__information--address-row::text').get(),
            'Type of property (House/apartment)':response.css('test'),
            }

The output is saved with the "scrapy crawl immospider -o results.csv" command line.

Expected output : The data is scraped from every card of every search page and displayed in a csv file.

Actual output : Url's and prices are displayed for the 30 cards appearing on the first search page, but the other data (locality, etc) is blank. I don't have any error in the terminal.

I read the documentation but I'm really new and it feels like there are infinite ways of doing this and i am a little overwhelmed.

Original Q&A

There are 1 answers

**Simeon Simeonov** · Answer 1 · 2023-05-17T13:06:48+00:00

There isn't any error because the missing data is hidden behind JS. Go to random offer and disable javascript(devtools). You will see all the information that is available to you/scrapy. One way to access it without using selenium is by the json info:

import json
import re

data = re.search(r"window.classified = (.*);",response.xpath('//div[@class="classified"]/script/text()').get()).group(1)

ps. you need to clean the response because json loads throws error

json.loads(data)["property"]["location"]["street"]

result is: 'Rue Jules Hans' testing for https://www.immoweb.be/en/classified/apartment/for-sale/braine-l%27alleud/1420/10572916. just play around with the keys.

You can also use the .get() method for dicts and you can convert the //div[@class="classified"]/script/text() using an online json linter to be more readable for you. Don't forget to remove the last ; and window.classified

TechQA.

Scrapy with playwright - scraping immoweb

There are 1 answers

Related Questions in PYTHON

Related Questions in SCRAPY

Related Questions in WINDOWS-SUBSYSTEM-FOR-LINUX

Related Questions in PLAYWRIGHT

Related Questions in SCRAPY-PLAYWRIGHT

Popular Questions

Trending Questions