scrapy + splash : not rendering full page javascript data

5k views Asked by At

I am just exploring scrapy with splash and I am trying to scrape all the product (pants) data with productid,name and price from one of the e-commerce site gap but I didn't see all the dynamic product data loaded when I see from splash web UI splash web UI (only 16 items are loading though for every request - no clue why) I tried with the following options but no luck

  • Increasing wait time upto 20 sec
  • By starting the docker with "--disable-private-mode"
  • By using lua_script for page scrolling
  • With view report full option splash:set_viewport_full()

lua_script2 = """ function main(splash)
    local num_scrolls = 10
    local scroll_delay = 2.0

    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(splash.args.wait)

    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end        
    return splash:html()
end"""                 
                              
            yield SplashRequest(
                url,
                self.parse_product_contents,
                endpoint='execute', 
                args={
                        'lua_source': lua_script2,
                        'wait': 5,
                    }
                )
 

Can anyone please shed some light on this behavior? p.s : I am using scrapy framework and I am able to parse the product information (itemid,name and price) from the render.html (but render.html has only 16 items information)

1

There are 1 answers

8
Tarun Lalwani On BEST ANSWER

I updated the script to below

function main(splash)
    local num_scrolls = 10
    local scroll_delay = 2.0
    splash:set_viewport_size(1980, 8020)
    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
--    splash:set_viewport_full()
    splash:wait(10)
    splash:runjs("jQuery('span.icon-x').click();")
    splash:wait(1)
    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end      

      splash:wait(30)

    return { 
        png = splash:png(),
        html = splash:html(),
        har = splash:har()
       }
end

And ran it in my local splash, the png doesn't work fine but the HTML has the last product

Last Image on page

Splash Rendered HTML

The only issue was when the email subscribe popup is there it won't scroll, so I added code to close it