Web scrape not returning full html

1.7k views Asked by At

I am attempting to scrape 'https://www.kaggle.com/kernels' in order to return all of the title names on the site, but I am running into an issue where the container for this detail 'div data-reactroot' is not being pulled into the scraped data.

import urllib
from bs4 import BeautifulSoup

kaggle = 'https://www.kaggle.com/kernels'
data = urllib.request.urlopen(kaggle).read()
htmlparse = BeautifulSoup(data, 'html.parser')
print(htmlparse.findAll("div", {"class" : "block-link block-link--bordered"}))

Is there an error in my code or is there some sort of block on the site preventing me from scraping this data?

2

There are 2 answers

0
Dan-Dev On BEST ANSWER

The data you want is fetched by JavaScript in json format each time you request the page. You can fetch it from "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all" like this.

import requests
import json
source = requests.get("https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all")
json_obj = source.json()
for a in json_obj:
    print (a["title"])

Outputs:

2004-2005 Landfalling Hurricanes animation
Visualization of StockData
Generating Sentences One Letter at a Time 
Decoding the Sexiest Job of 21st Century!!
Novice to Grandmaster
Analysis  on Pokemon Data
ROC Curve with k-Fold CV
Japan Bulgaria trade playground
Bootstrapping and CIs with Veteran Suicides
Replicating "Did I do that?" paper analyses with R
Social Progress Index and World Happiness Report
SVM+HOG On ColourCompositeImage
Low- level students
PyTorch Speech Recognition Challenge (WIP)  
Loans -getting Insights
Exploring Youtube Trending Statistics EDA
3 Simple Steps (LB: .9878 with new data)
Titanic: Neural Network using Keras
Feature Engineering 
Why do employees leave and what to do about it

The only thing you will have to change is the "after" query string parameter which in my request was 439354 but you could set it to 0 to get the first records.

You could also change the amount of records returned by changing the "pageSize" query string parameter e.g. "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all"

Outputs:

Data ScienceTutorial for Beginners
Data visualization and investigation
Spooky NLP and Topic Modelling tutorial
20 Years Of Games Analysis
NYC Taxi EDA - Update: The fast & the curious

Or an example with urllib:

import urllib.request
import json
kaggle = "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all"
data = urllib.request.urlopen(kaggle).read()
json_obj = json.loads(data.decode("utf-8"))
for a in json_obj:
    print (a["title"])
0
Bohdan Kaminskyi On

As Elis Byberi wrote, the problem is really that you are attempting to get data before data is rendered from backend. You can get content of page after backend worked by using phantomjs. You can find small tutorial here