getting specific images from page

108 views Asked by At

I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl:

redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl")
redditHtml = redditFile.read()
redditFile.close()

soup = BeautifulSoup(redditHtml)

productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
for div in productDivs:
    print div.find('a')['t1']  #works fine
    print div.find('img')['src'] #This getting issue KeyError: 'src'

But this gives only title, not the image source Is there anything wrong?

Edit: I have edited my source, still could not get image url.

2

There are 2 answers

3
Vikas Ojha On BEST ANSWER

Bing is using some techniques to block automated scrapers. I tried to print

div.find('img')

and found that they are sending source in attribute names src2, so following should work -

div.find('img')['src2']

This is working for me. Hope it helps.

1
alecxe On

If you open up browser develop tools, you'll see that there is an additional async XHR request issued to the http://www.bing.com/images/async endpoint which contains the image search results.

Which leads to the 3 main options you have:

  • simulate that XHR request in your code. You might want to use something more suitable for humans than urllib2; see requests module. This would be so called "low-level" approach, going down to the bare metal and web-site specific implementation which would make this option non-reliable, difficult, "heavy", error-prompt and fragile

  • automate a real browser using selenium - stay on the high-level. In other words, you don't care how the results are retrieved, what requests are made, what javascript needs to be executed. You just wait for search results to appear and extract them.

  • use Bing Search API (this should probably be option #1)