Crawling dynamic loading images from TripAdvisor

960 views Asked by At

I am trying to scrape reviews from TripAdvisor website. As most of the images in the website are loaded dynamically, I use Splash javascript rendering service to generate the pages.

The problem is some of the images are loaded, and some are not.

Here is the URL of the review I want to crawl: https://www.tripadvisor.com.sg/ShowUserReviews-g294265-d1770798-r446535418-Marina_Bay_Sands-Singapore.html

I have tried to set Splash wait time to 10 seconds (Maximum) and the result is still the same.

Here is my code that was used in Splash:

function main(splash)
  local url = splash.args.url
  assert(splash:go(url))
  assert(splash:wait(10))
  splash:set_viewport_full()
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

And here is the result image that generated by Splash (Croped out footer section): Click to view the image

As you can see, all other dynamic loading images were loaded except the images in the review (they should be in the red rectangle). I've checked the html and found that the img tags are existed, but their src attribute was ".../x.gif" which is one pixel image instead of the URL to real images.

Is there anyone has a problem like this or have an idea why it was?

1

There are 1 answers

0
Richard Dowinton On

The images seem to be loaded when you scroll to them. However, when I tried using Splash to scroll to them, I was unable to get it to render the images despite setting a delay.

If you look at the response body, you'll notice that the images are contained in a JavaScript array named lazyImgs, and each image has an ID. You could read each ID from the placeholder elements when you traverse the reviews, and use them to retrieve the images from the JavaScript array. This is likely the simplest solution.