I am attempting to scrape 'https://www.kaggle.com/kernels' in order to return all of the title names on the site, but I am running into an issue where the container for this detail 'div data-reactroot' is not being pulled into the scraped data.
import urllib
from bs4 import BeautifulSoup
kaggle = 'https://www.kaggle.com/kernels'
data = urllib.request.urlopen(kaggle).read()
htmlparse = BeautifulSoup(data, 'html.parser')
print(htmlparse.findAll("div", {"class" : "block-link block-link--bordered"}))
Is there an error in my code or is there some sort of block on the site preventing me from scraping this data?
The data you want is fetched by JavaScript in json format each time you request the page. You can fetch it from "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all" like this.
Outputs:
The only thing you will have to change is the "after" query string parameter which in my request was 439354 but you could set it to 0 to get the first records.
You could also change the amount of records returned by changing the "pageSize" query string parameter e.g. "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all"
Outputs:
Or an example with urllib: