How can I scrape a site with multiple pages using beautifulsoup and python?

771 views Asked by At

I am trying to scrape a website. This is a continuation of this soup.findAll is not working for table

I was able to obtain needed data but the site has multiple pages which vary by the day. Some days it can be 20 pages and 33 pages on another. I was trying to implement this solution by obtaining the last page element How to scrape the next pages in python using Beautifulsoup but when I got to the pager div in on the site I want to scrape I found this format

   <a class="ctl00_cph1_mnuPager_1" href="javascript:__doPostBack('ctl00$cph1$mnuPager','32')">32</a>
   <a class="ctl00_cph1_mnuPager_1">33</a>

how can I scrape all the pages in the site given that it the amount of pages change daily? by the way page url does not change with page changes.

4

There are 4 answers

0
amarynets On
  1. BS4 will not solve this issues anytime, because of it can't run Js
  2. First, you can try to use Scrapy and this answer
  3. You can use Selenium for it
0
django11 On

I would learn how to use Selenium -- it's simple and effective in handling situations where BS4 won't do the job.

You can use it to log into sites, enter keys into search boxes, and click buttons on the screen. Not to mention, you can watch what it's doing with a browser.

I use it even when I'm doing something in BS4 to monitor the progress better of a scraping project.

0
danielfrg On

Like some people have mentioned you might want to look at selenium. I wrote a blogpost for doing something like this a while back: http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/

Now things are much better with chrome and firefox headless.

1
Inferis On

Okay, so if I'm understanding correctly, there's an undetermined amount of pages that you want to scrape? I had a similar issue if that's the case. Look at the inspected page and see if there is an element that doesn't exist there but exists on the pages with content.

In my for loop I used

`pages = list(map(str, range(1, 5000))) /5000 is just a large number that what I searching for wouldn't reach that high.

for n in pages: base_url = 'url here' url = base_url + n /n is the number of pages at the end of my url

/this is the element that didn't exist after the pages with content finished
figure = soup.find_all("figure")
if figure:
pass
else:
break /would break out of the page iterations and jump to my other listing in 
       another url after there wasn't any content left on the last page` 

I hope this helps some, or helps cover what you needed.