Parse nested XML in R with repeated tags (OSM history)

27 views Asked by At

I'm trying to read the history of OpenStreetMap elements, which is a nested XML object with repeated tags in R, and I have not been able to progress up from a certain point. This is part of an effort to detect vandalism in certain kinds of elements.

A sample object that I could read is this one, which looks like this.

<osm version="0.6" generator="CGImap 0.8.8 (1674799 spike-06.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
<relation id="368462" visible="true" version="1" changeset="3430908" timestamp="2009-12-22T22:50:41Z" user="Blanes Lluis" uid="172008">
<member type="way" ref="45378201" role="outer"/>
<member type="way" ref="45338918" role="outer"/>
<member type="way" ref="45362285" role="outer"/>
<member type="way" ref="45362286" role="outer"/>
<member type="way" ref="45323394" role="outer"/>
<member type="way" ref="44988246" role="outer"/>
<member type="way" ref="45341594" role="outer"/>
<member type="way" ref="45369117" role="outer"/>
<member type="way" ref="45324958" role=""/>
<member type="way" ref="45324960" role=""/>
<tag k="admin_level" v="7"/>
<tag k="boundary" v="administrative"/>
<tag k="name" v="La Selva"/>
<tag k="type" v="boundary"/>
<tag k="wikipedia" v="es:La Selva"/>
</relation>
<relation id="368462" visible="true" version="2" changeset="3430948" timestamp="2009-12-22T22:59:53Z" user="Blanes Lluis" uid="172008">
<member type="way" ref="45378201" role="outer"/>
<member type="way" ref="45338918" role="outer"/>
<member type="way" ref="45362285" role="outer"/>
<member type="way" ref="45362286" role="outer"/>
<member type="way" ref="45323394" role="outer"/>
<member type="way" ref="44988246" role="outer"/>
<member type="way" ref="45341594" role="outer"/>
<member type="way" ref="45369117" role="outer"/>
<member type="way" ref="45324958" role="outer"/>
<member type="way" ref="45324960" role="outer"/>
<member type="way" ref="45324953" role="outer"/>
<member type="way" ref="45341583" role="outer"/>
<member type="way" ref="45323298" role="outer"/>
<member type="way" ref="45355688" role="outer"/>
<tag k="admin_level" v="7"/>
<tag k="boundary" v="administrative"/>
<tag k="name" v="La Selva"/>
<tag k="type" v="boundary"/>
<tag k="wikipedia" v="es:La Selva"/>
</relation>

I'm trying to get a data.frame that looks like this or, at the very least, two lists of data.frames containing each version of the object, one for all members and one for all tags. It should look like this.

> members
relation_id version changeset user member_type member_ref member_role
1      112233       1     75757  Joe         way      12345       outer
2      112233       1     75757  Joe         way       6789       outer
3      112233       1     75757  Joe        node     123456            
4      112233       2     86868 Jane         way      12345       outer
5      112233       2     86868 Jane         way       6789       outer
6      112233       2     86868 Jane        node     123456            

> tags
  relation_id version changeset user         key                   value
1      112233       1     75757  Joe        name                La Selva
2      112233       1     75757  Joe    boundary          administrative
3      112233       1     75757  Joe admin_level                       7
4      112233       2     86868 Jane        name         Vandalised name
5      112233       2     86868 Jane    boundary look at me I'm a vandal
6      112233       2     86868 Jane admin_level        Vandalism rules!

The problem is that when I try to convert the xml to a list, all tags and members information get into the attributes, as such.

$osm$relation$tag
list()
attr(,"k")
[1] "population"
attr(,"v")
[1] "172280"

$osm$relation$tag
list()
attr(,"k")
[1] "population:date"
attr(,"v")
[1] "2011"

If I try to convert it into a data.frame or a tibble using methods such as this one or this one, I end up with empty data.frames since the lists themselves are empty.

The most promising approach is this one

library(httr)
library(xml2)
selva <- GET("https://www.openstreetmap.org/api/0.6/relation/368462/history")
selva2 <- read_xml(selva)
rels <- xml2::xml_find_all(selva2, "//relation")
reltags <- lapply(rels, \(x) xml2::xml_find_all(x, "//relation"))

Which returns something like this.

> head(rels)
{xml_nodeset (6)}
[1] <relation id="368462" visible="true" version="1" changeset="3430908" timestamp="2009-12-22T22:50:41Z" user="Blanes Lluis ...
[2] <relation id="368462" visible="true" version="2" changeset="3430948" timestamp="2009-12-22T22:59:53Z" user="Blanes Lluis ...
[3] <relation id="368462" visible="true" version="3" changeset="3430957" timestamp="2009-12-22T23:01:26Z" user="Blanes Lluis ...
[4] <relation id="368462" visible="true" version="4" changeset="3430967" timestamp="2009-12-22T23:04:15Z" user="Blanes Lluis ...
[5] <relation id="368462" visible="true" version="5" changeset="3431008" timestamp="2009-12-22T23:14:07Z" user="Blanes Lluis ...
[6] <relation id="368462" visible="true" version="6" changeset="3431022" timestamp="2009-12-22T23:19:20Z" user="Blanes Lluis
>head(reltags)
[[1]]
{xml_nodeset (1263)}
 [1] <tag k="admin_level" v="7"/>
 [2] <tag k="boundary" v="administrative"/>
 [3] <tag k="name" v="La Selva"/>
 [4] <tag k="type" v="boundary"/>
 [5] <tag k="wikipedia" v="es:La Selva"/>
 [6] <tag k="admin_level" v="7"/>
 [7] <tag k="boundary" v="administrative"/>
 [8] <tag k="name" v="La Selva"/>
 [9] <tag k="type" v="boundary"/>
[10] <tag k="wikipedia" v="es:La Selva"/>

However, trying to get it into a data.frame returns the following

reltags[[1]]  |> as_list() |> as_tibble(.name_repair = "unique")

A tibble: 0 × 1,263
# ℹ 1,263 variables: ...1 <list>, ...2 <list>, ...3 <list>, ...4 <list>, ...5 <list>, ...6 <list>, ...7 <list>, ...8 <list>,
#   ...9 <list>, ...10 <list>, ...11 <list>, ...12 <list>, ...13 <list>, ...14 <list>, ...15 <list>, ...16 <list>,
#   ...17 <list>, ...18 <list>, ...19 <list>, ...20 <list>, ...21 <list>, ...22 <list>, ...23 <list>, ...24 <list>,
#   ...25 <list>, ...26 <list>, ...27 <list>, ...28 <list>, ...29 <list>, ...30 <list>, ...31 <list>, ...32 <list>,
#   ...33 <list>, ...34 <list>, ...35 <list>, ...36 <list>, ...37 <list>, ...38 <list>, ...39 <list>, ...40 <list>,
#   ...41 <list>, ...42 <list>, ...43 <list>, ...44 <list>, ...45 <list>, ...46 <list>, ...47 <list>, ...48 <list>,
#   ...49 <list>, ...50 <list>, ...51 <list>, ...52 <list>, ...53 <list>, ...54 <list>, ...55 <list>, ...56 <list>,

Which is, of course, not what I'm looking for. At this point, I haven't got an idea about where to go next.

EDIT, this solution gets a lot closer to the result, but it keeps losing the member and tag information.

library(httr2)
library(tidyverse)
selva <- httr2::request("https://www.openstreetmap.org/api/0.6/relation/368462/history") |> 
  req_perform()

selvaxml <- resp_body_xml(selva)

resp_l <- selvaxml |> 
  xml2::as_list() |> 
  unlist(recursive = FALSE) |> 
  lapply(X = _, attributes) |>
  tibble::as_tibble(.name_repair = "unique") |> 
  janitor::clean_names()


resp_unnested <- resp_l |> 
  t() |> 
  as.data.frame() |> 
  setNames(c("elements", "osm_id", "visible", "version", "changeset", "timestamp", "user", "user_id")) |> 
  unnest(everything())
# A tibble: 10,241 × 8
   elements osm_id visible version changeset timestamp            user         user_id
   <chr>    <chr>  <chr>   <chr>   <chr>     <chr>                <chr>        <chr>  
 1 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 2 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 3 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 4 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 5 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 6 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 7 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 8 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
 9 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
10 member   368462 true    1       3430908   2009-12-22T22:50:41Z Blanes Lluis 172008 
# ℹ 10,231 more rows
0

There are 0 answers