I am trying to parse a number of documents using the excellent xml2
R library. As an example, consider the following XML file:
pg <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")
Which contains a number of <speech>
tags which are separated, though not nested within, a number of <minor-heading>
and <major-heading>
tags. I would like to be process this document to a resulting data.frame
with the following structure:
major_heading_id speech_text
heading_id_1 text1
heading_id_1 text2
heading_id_2 text3
heading_id_2 text4
Unfortunately, because the tags are not nested, I cannot figure out how to do this! I have code that successfully recovers the relevant information (see below), but matching the speech tags to their respective major-headings is beyond me.
My intuition is that it would probably be best to split the XML document at the heading tags, and then process each as an individual document, but I couldn't find a function in the xml2
package that would let me do this!
Any help would be great.
Where I have got to so far:
speech_recs <- xml_find_all(pg, "//speech")
speech_text <- trimws(xml_text(speech_recs))
heading_recs <- xml_find_all(pg, "//major-heading")
major_heading_id <- xml_attr(heading_recs, "id")
You can do this as follows:
This results in: