I'm trying to figure out how to scrape the data from the following url: https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx
Here is the type of data:
It appears that everything is populated from a database and loaded into the webpage via javascript.
I've done something similar in the past using selenium
and PhantomJS
but I can't figure out how to get these data fields in Python.
As expected, I can't use pd.read_html
for this type of problem.
Is it possible to parse the results from:
from selenium import webdriver
url="https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx"
browser = webdriver.PhantomJS()
browser.get(url)
content = browser.page_source
Or maybe to access the actual underlying data?
If not, what are other approaches short of copy and pasting for hours?
EDIT:
Building on the answer below, from @thenullptr I have been able to access the material but only on page 1. How can I adapt this to go across all of the pages [recommendations to parse properly]? My end goal is to have this in a pandas dataframe
import requests
from bs4 import BeautifulSoup
r = requests.post(
url = 'https://search.aap.org/nicu/',
data = {'SearchCriteria.Level':'1', 'X-Requested-With':'XMLHttpRequest'},
) #key:value
html = r.text
# Parsing the HTML
soup = BeautifulSoup(html.split("</script>")[-1].strip(), "html")
div = soup.find("div", {"id": "main"})
div = soup.findAll("div", {"class":"blue-border panel list-group"})
def f(x):
ignore_fields = ['Collapse all','Expand all']
output = list(filter(bool, map(str.strip, x.text.split("\n"))))
output = list(filter(lambda x: x not in ignore_fields, output))
return output
results = pd.Series(list(map(f, div))[0])
To follow on from my last comment, the below should give you a good starting point. When looking through the XHR calls you just want to see what data is being sent and received from each one to pinpoint the one you need. The below is the raw POST data being sent to the API when doing a search, it looks like you need to use at least one and include the last one.
Here is a simple example of how you can send a post request using the requests library, the web page will reply with the raw data so you can use BS or similar to parse it to get the information you need.
prints
<strong class="col-md-8 white-text">JOHN PETER SMITH HOSPITAL</strong>...
https://requests.readthedocs.io/en/master/user/quickstart/