How to webscrape data from a webpage with dynamic HTML (Python)?

277 views Asked by At

I'm trying to figure out how to scrape the data from the following url: https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx

Here is the type of data:

enter image description here

It appears that everything is populated from a database and loaded into the webpage via javascript.

I've done something similar in the past using selenium and PhantomJS but I can't figure out how to get these data fields in Python.

As expected, I can't use pd.read_html for this type of problem.

Is it possible to parse the results from:

from selenium import webdriver

url="https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx"

browser = webdriver.PhantomJS()
browser.get(url)
content = browser.page_source

Or maybe to access the actual underlying data?

If not, what are other approaches short of copy and pasting for hours?

EDIT:

Building on the answer below, from @thenullptr I have been able to access the material but only on page 1. How can I adapt this to go across all of the pages [recommendations to parse properly]? My end goal is to have this in a pandas dataframe

import requests
from bs4 import BeautifulSoup

r = requests.post(
    url = 'https://search.aap.org/nicu/', 
    data = {'SearchCriteria.Level':'1', 'X-Requested-With':'XMLHttpRequest'}, 

) #key:value
html = r.text

# Parsing the HTML
    soup = BeautifulSoup(html.split("</script>")[-1].strip(), "html")
div = soup.find("div", {"id": "main"})

div = soup.findAll("div", {"class":"blue-border panel list-group"})
def f(x):
    ignore_fields = ['Collapse all','Expand all']
    output = list(filter(bool, map(str.strip, x.text.split("\n"))))
    output = list(filter(lambda x: x not in ignore_fields, output))
    return output
results = pd.Series(list(map(f, div))[0])
1

There are 1 answers

6
thenullptr On BEST ANSWER

To follow on from my last comment, the below should give you a good starting point. When looking through the XHR calls you just want to see what data is being sent and received from each one to pinpoint the one you need. The below is the raw POST data being sent to the API when doing a search, it looks like you need to use at least one and include the last one.

{
    "SearchCriteria.Name": "smith",
    "SearchCriteria.City": "",
    "SearchCriteria.State": "",
    "SearchCriteria.Zip": "",
    "SearchCriteria.Level": "",
    "SearchCriteria.LevelAssigner": "",
    "SearchCriteria.BedNumberRange": "",
    "X-Requested-With": "XMLHttpRequest"
}

Here is a simple example of how you can send a post request using the requests library, the web page will reply with the raw data so you can use BS or similar to parse it to get the information you need.

import requests
r = requests.post('https://search.aap.org/nicu/', 
data = {'SearchCriteria.Name':'smith', 'X-Requested-With':'XMLHttpRequest'}) #key:value
print(r.text)

prints <strong class="col-md-8 white-text">JOHN PETER SMITH HOSPITAL</strong>...

https://requests.readthedocs.io/en/master/user/quickstart/