Linked Questions

Popular Questions

I am trying to write a basic web scraper that looks through a forum, goes into each post, then checks to see if the post has any github links, storing those links. I am doing this as a part of my research to see how people use and implement Smart Device routines.

I'm fairly new to web scraping, and have been using BeautifulSoup, but I've run into a strange issue. First, my program:

from bs4 import BeautifulSoup
import requests
from user_agent import generate_user_agent

url = 'https://community.smartthings.com/c/projects-stories'

headers = {'User-Agent': generate_user_agent(device_type="desktop", os=('linux'))}
page_response = requests.get(url, timeout=5, headers=headers)

page = requests.get(url, timeout = 5)
#print(page.content)
if page.status_code == 200:
    print('URL: ', url, '\nRequest Successful!')
content = BeautifulSoup(page.content, 'html.parser')
print(content.prettify())

project_url = []
for i in content:
    project_url += content.find_all("/div", class_="a href")
print(project_url)

What I'm trying to do right now is simply collect all the url links to each individual post on the website. When I try to do this, it returns an empty list. After some experimentation in trying to pick out a specific url based on it's ID, I found that while the ID of each post does not seem to change every time the page is reloaded, it DOES change if the website detects that a scraper is being used. I believe this considering that when the contents of the webpage is printed to the console, at the end of the HTML data, there is a section that reads:

  <!-- include_crawler_content? -->
  </div>
  <footer class="container">
   <nav class="crawler-nav" itemscope="" itemtype="http://schema.org/SiteNavigationElement">
    <a href="/">
     Home
    </a>
    <a href="/categories">
     Categories
    </a>
    <a href="/guidelines">
     FAQ/Guidelines
    </a>
    <a href="/tos">
     Terms of Service
    </a>
    <a href="/privacy">
     Privacy Policy
    </a>
   </nav>

The website seems to detect the crawler and change the navigation based on that. I've tried generating a new user_agent to trick it, but I've had no luck.

Any ideas?

Related Questions