Parsing non-nested XML tags in R

Question

Parsing non-nested XML tags in R

123 views Asked by user2728808 At 16 December 2016 at 12:12

I am trying to parse a number of documents using the excellent xml2 R library. As an example, consider the following XML file:

pg <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")

Which contains a number of <speech> tags which are separated, though not nested within, a number of <minor-heading> and <major-heading> tags. I would like to be process this document to a resulting data.frame with the following structure:

     major_heading_id  speech_text
     heading_id_1       text1
     heading_id_1       text2
     heading_id_2       text3
     heading_id_2       text4

Unfortunately, because the tags are not nested, I cannot figure out how to do this! I have code that successfully recovers the relevant information (see below), but matching the speech tags to their respective major-headings is beyond me.

My intuition is that it would probably be best to split the XML document at the heading tags, and then process each as an individual document, but I couldn't find a function in the xml2 package that would let me do this!

Any help would be great.

Where I have got to so far:

speech_recs <- xml_find_all(pg, "//speech")
speech_text <- trimws(xml_text(speech_recs))

heading_recs <- xml_find_all(pg, "//major-heading")
major_heading_id <- xml_attr(heading_recs, "id")

Original Q&A

There are 1 answers

**Rentrop** · Accepted Answer · 2016-12-16T12:40:03+00:00

You can do this as follows:

require(xml2)
require(tidyverse)
doc <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")

# Get the headings
heading_recs <- xml_find_all(doc, "//major-heading")

# path creates the structure you want
# so the speech nodes that have exactly n headings above them.
path <- sprintf("//speech[count(preceding-sibling::major-heading)=%d]", 
                seq_along(heading_recs))

# Get the text of the speech nodes
map(path, ~xml_text(xml_find_all(doc, .x))) %>% 
# Combine it with the id of the headings
  map2_df(xml_attr(heading_recs, "id"), 
          ~tibble(major_heading_id = .y, speech_text = .x))

This results in:

TechQA.

Parsing non-nested XML tags in R

There are 1 answers

Related Questions in R

Related Questions in XML

Related Questions in XML2

Popular Questions

Popular Tags

Trending Questions