How can I scrape the correct number of URLs from an infinite-scroll webpage?

658 views Asked by At

I am trying to scrape URLs from a webpage. I am using this code:

from bs4 import BeautifulSoup

import urllib2 

url = urllib2.urlopen("http://www.barneys.com/barneys-new-york/men/clothing/shirts/dress/classic#sz=176&pageviewchange=true")

content = url.read()
soup = BeautifulSoup(content)

links=soup.find_all("a", {"class": "thumb-link"})

for link in links:

      print (link.get('href'))

But what I'm getting as output is just 48 links instead of 176. What am I doing wrong?

1

There are 1 answers

5
heinst On

So what I did is I used Postmans interceptor feature to look at the call the website made each time it loaded the next set of 36 shirts. Then from there replicated the calls in code. You can't dump it all 176 items all at once so I replicated the 36 at a time the website did.

from bs4 import BeautifulSoup
import requests

urls = []

for i in range(1, 5):
    offset = 36 * i
    r = requests.get('http://www.barneys.com/barneys-new-york/men/clothing/shirts/dress/classic?start=1&format=page-element&sz={}&_=1434647715868'.format(offset))
    soup = BeautifulSoup(r.text)

    links = soup.find_all("a", {"class": "thumb-link"})

    for link in links:
        if len(urls) < 176:
            print (link.get('href'))
            urls.append(link.get('href'))