Screen scraping images (ie. Firefox Page Info / Google Images)

1.8k views Asked by At

Preferably using python (currently using urllib and BeautifulSoup) given a URL.

For example I'm trying to scrape the main photo on this page: http://www.marcjacobs.com/marc-jacobs/womens/bags-and-accessories/c3122001/the-single#?p=1&s=12

In Firefox under Tools > Page Info > Media lists all the visible images, including the link to the image I want to grab ( http://imagesec.mj.ctscdn.com/image/336/504/6ace6aac-c049-4d7e-9465-c19b5cd8e4ac.jpg )

Two interrelated problems:

  1. If I do a view source the image path retrieved from the Firefox tool is not found in the html document... Is there any way I can retrieve this path without going through Firefox Page Info? Perhaps through either Python and/or Javascript/JQuery?
  2. I'm trying to get the photo of the product in "Orange", and notice the page always loads the black color by default

A working example is probably Google 'Shopping', if you type the name of this product and select color, the image shows up in the correct color (from the exact same page) in the search results.

Basically, I want to be able to scrape color and style/variation specific images from mostly shopping sites.

Selecting the right color seems more complicated, in that case I'll settle for just the main product image in black for now..

So far I've tried selecting images based on img height tags, also trying to read dimensions when there are no height/width tags... but occurred to me there has to be a better way.

1

There are 1 answers

4
r_31415 On

This can be a bit complex but most of the solutions that work in this particular situation, are pretty much the same.

First, let me tell you why using Beautiful Soup or xlml won't work. You need to retrieve some information which is available only after you click on the orange bag thumbnail, right?. That is loaded using Javascript, so that orange bag image won't be available to Beautiful Soup and friends (because they don't parse Javascript nor elements that are absent in the parsed tree). So that is a death end.

However, there are other screen scraping tools like Selenium or PhantomJS. I have tested both and work great. They basically integrate a browser so they obviously are capable of managing javascript. I don't know if you need to scrape this automatically from your server or you want to start the scraping process at will. With Selenium (after you tell it what page you want to open, what element you want to click, etc), you will see your browser doing all that stuff by itself. There are other options available such as using a headless browser. In my opinion, it's very powerful but it can get quite complex to get it working.

A far more easier solution is using PhantomJs. It's similar to Selenium although, as its name indicates, you give the instructions via Javascript (which can be a bit more comfortable since you're already dealing with web elements). I recommend you to use CasperJS: It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks...

Let me give some feel for what it looks like:

casperFunction = function(){
    var casper = require('casper').create({
        verbose: true
    });

    casper.start('yourwebpage'); // loading the webpage

    casper.then(function(){ // after loading...
        casper.evaluate(function(){ // get me some element's value
            document.getElementById('yourelement').value
        });

    });

    casper.then(function(){ // after that, click on this other element
        this.click('#id_of_other_element');
    })

    casper.wait(7000); // wait for some processing... this can be quite 
                       // useful if you need to wait a few seconds in 
                       // order to retrieve your orange bag later

    casper.run(); // actually runs the whole thing

There you have most of the things you need to accomplish your task.

By the way, let me remind you that usually it's needed to ask for permission to retrieve that kind of stuff.

Hope that helps.