Parsing non-nested XML tags in R

125 views Asked by At

I am trying to parse a number of documents using the excellent xml2 R library. As an example, consider the following XML file:

pg <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")

Which contains a number of <speech> tags which are separated, though not nested within, a number of <minor-heading> and <major-heading> tags. I would like to be process this document to a resulting data.frame with the following structure:

     major_heading_id  speech_text
     heading_id_1       text1
     heading_id_1       text2
     heading_id_2       text3
     heading_id_2       text4

Unfortunately, because the tags are not nested, I cannot figure out how to do this! I have code that successfully recovers the relevant information (see below), but matching the speech tags to their respective major-headings is beyond me.

My intuition is that it would probably be best to split the XML document at the heading tags, and then process each as an individual document, but I couldn't find a function in the xml2 package that would let me do this!

Any help would be great.

Where I have got to so far:

speech_recs <- xml_find_all(pg, "//speech")
speech_text <- trimws(xml_text(speech_recs))

heading_recs <- xml_find_all(pg, "//major-heading")
major_heading_id <- xml_attr(heading_recs, "id")
1

There are 1 answers

0
Rentrop On BEST ANSWER

You can do this as follows:

require(xml2)
require(tidyverse)
doc <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")

# Get the headings
heading_recs <- xml_find_all(doc, "//major-heading")

# path creates the structure you want
# so the speech nodes that have exactly n headings above them.
path <- sprintf("//speech[count(preceding-sibling::major-heading)=%d]", 
                seq_along(heading_recs))

# Get the text of the speech nodes
map(path, ~xml_text(xml_find_all(doc, .x))) %>% 
# Combine it with the id of the headings
  map2_df(xml_attr(heading_recs, "id"), 
          ~tibble(major_heading_id = .y, speech_text = .x))

This results in:

enter image description here