Use rvest to scrape all p after h? (or other R package)

3k views Asked by At

I am new to the world of html scraping and am having difficulty pulling in paragraphs under particular headings, using rvest in R.

I want to scrape info from multiple sites that all have a relatively similar set up. They all have the same headings but the number of paragraphs under a heading can change. I was able to scrape specific paragraphs under a heading with the following code:

unitCode <- data.frame(unit = c('SLE010', 'SLE115', 'MAA103'))

html <- sapply(unitCode, function(x) paste("http://www.deakin.edu.au/current-students/courses/unit.php?unit=", 
                                          x,
                                          "&return_to=%2Fcurrent-students%2Fcourses%2Fcourse.php%3Fcourse%3DS323%26version%3D3", 
                                          sep = ''))
assessment <- html[3] %>%
              html() %>%
              html_nodes(xpath='//*[@id="main"]/div/div/p[3]') %>%
              html_text()

The 'xpath' element pulls in the first paragraph under the assessment heading. Some of the pages have multiple paragraphs under the assessment heading which I can i get if I change the 'xpath' variable to specify them specifically, e.g. p[4] or p[5]. Unfortunately I want to iterate this process over hundreds of pages, so changing the xpath each time isn't appropriate, and I don't even know how many paragraphs there will be in every page.

I think that pulling all < p > after the heading I am interested in is the best option considering the uncertainty around the set-up of the pages.

I was wondering if there is a way to scrape all < p > after < h3 >Assessment< h3 > using rvest or some other R scraping package?

1

There are 1 answers

3
hrbrmstr On BEST ANSWER

I expanded this out only for demo purposes. You should be able to apply it to your original code. It's really not a good idea to overwrite names in namespaces you end up using. Also note that I'm using the latest (github/devtools version of) rvest which uses xml2 and deprecated html.

The key is xpath="//h3[contains(., 'Assessment')]/following-sibling::p", thusly:

library(rvest)

unitCode <- data.frame(unit = c('SLE010', 'SLE115', 'MAA103'))

sites <- sapply(unitCode, function(x) paste("http://www.deakin.edu.au/current-students/courses/unit.php?unit=", 
                                          x,
                                          "&return_to=%2Fcurrent-students%2Fcourses%2Fcourse.php%3Fcourse%3DS323%26version%3D3", 
                                          sep = ''))

pg <- read_html(sites[1])
pg_2 <- read_html(sites[2])
pg_3 <- read_html(sites[3])

pg %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (2)}
## [1] <p>This unit is assessed on a pass/fail basis. Multiple-choice on-line test   ...
## [2] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

pg_2 %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (3)}
## [1] <p>Mid-trimester test 20%, three assignments (3 x 10%) 30%, examination 50%.</p>
## [2] <p>* Rate for all CSP students, except for those who commenced Education and  ...
## [3] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

pg_3 %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (6)}
## [1] <p>Assessment 1 (Group of 3 students) - Student video presentation (5-7 mins) ...
## [2] <p>Assessment 2 (Group of 3 students) - Business plan (3500-4000 words) - 30% ...
## [3] <p>Examination (2 hours) - 60%</p>
## [4] <p><a href="http://www.deakin.edu.au/glossary?result_1890_result_page=H" targ ...
## [5] <p>* Rate for all CSP students, except for those who commenced Education and  ...
## [6] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

You can probably use that <p style="margin-top: 2em;"> as a marker to stop, too. You should check out xml2's as_list to help.