Scrapy with playwright - scraping immoweb

302 views Asked by At

Configuration: working on WSL with ubuntu terminal. Coding in python with vs code. Installed modules : scrapy, scrapy-playwright, playwright

Project: extract the data from the website www.immoweb.be (belgian real estate website). Javascript components present, hence the playwright module.

Starting url : search results for houses and apartments across Belgium

Here's the code I'm running.

import scrapy
from scrapy_playwright.page import PageMethod

class ImmoSpider(scrapy.Spider):
    name = "immospider"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.immoweb.be/en/search/house-and-apartment/for-sale?countries=BE&page=1&orderBy=relevance",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", 
            'article.card.card--result.card--xl'),
                ],
            },
        )

    async def parse(self, response):
        properties = response.css('article.card.card--result.card--xl')

        **#untested loop. Goal : go through every page and scrape the data from every card**
        
        """
        for page_num in range(1, 10):
                url = f'https://www.immoweb.be/en/search/house-and-apartment/for-sale?countries=BE&page={page_num}&orderBy=relevance'
                yield Request(url=url, callback=self.parse, 
                                        meta={'page': page_num})
        """

        for property in properties:
            #link = response.urljoin(property.xpath('.//a[text()]/@href').get
            url = property.css('h2 a::attr(href)').get()
            yield scrapy.Request(url,
                                 callback=self.parse_product,
                                 meta={
                                     "playwright": False
                                 }
                                 )

    async def parse_product(self, response):
        yield {
            'url' : response.url,
            'Price' : response.css('.classified__header-primary-info p.classified__price span.sr-only::text').get(),
            'Living Area' : response.css('#accordion_eeca443b-8b41-4284-b4af-5ab3f1622768 td.classified-table__data::text').get(),
            'Locality': response.css('span.classified__information--address-row::text').get(),
            'Type of property (House/apartment)':response.css('test'),
            }

The output is saved with the "scrapy crawl immospider -o results.csv" command line.

Expected output : The data is scraped from every card of every search page and displayed in a csv file.

Actual output : Url's and prices are displayed for the 30 cards appearing on the first search page, but the other data (locality, etc) is blank. I don't have any error in the terminal.

I read the documentation but I'm really new and it feels like there are infinite ways of doing this and i am a little overwhelmed.

1

There are 1 answers

0
Simeon Simeonov On

There isn't any error because the missing data is hidden behind JS. Go to random offer and disable javascript(devtools). You will see all the information that is available to you/scrapy. One way to access it without using selenium is by the json info:

import json
import re

data = re.search(r"window.classified = (.*);",response.xpath('//div[@class="classified"]/script/text()').get()).group(1)

ps. you need to clean the response because json loads throws error

json.loads(data)["property"]["location"]["street"]

result is: 'Rue Jules Hans' testing for https://www.immoweb.be/en/classified/apartment/for-sale/braine-l%27alleud/1420/10572916. just play around with the keys.

You can also use the .get() method for dicts and you can convert the //div[@class="classified"]/script/text() using an online json linter to be more readable for you. Don't forget to remove the last ; and window.classified