How to find the path of an element in a nested list

208 views Asked by At

How can I find the path of an element in a nested list without manually digging through a list in a View?

Here is an example that I can already deal with:

l1 <- list(x = list(a = "no_match", b = "test_noname", c ="test_noname"),
           y = list(a = "test_name"))

After looking for an off-the-shelf solution in other packages, I found this approach (strongly inspired by rlist::list.search):

list_search <- function(l, f) {
  ulist <- unlist(l, recursive = TRUE, use.names = TRUE)
  match <- f(ulist)
  ulist[match]
}
list_search(l1, f = \(x) x == "test_noname")
          x.b           x.c 
"test_noname" "test_noname" 

This works pretty well as it’s easy to understand that the name “x.b” here can be translated for access like this:

l1[["x"]][["b"]]
[1] "test_noname"
# Or
purrr::pluck(l1, "x", "b")
[1] "test_noname"

And I can get all elements on the same level, by leaving out the last level index:

l1[["x"]]
$a
[1] "no_match"

$b
[1] "test_noname"

$c
[1] "test_noname"

This is usually my goal, as I know the values/name of one of the elements I want to get to and other similar elements are placed on the same sub-level (or sub-sub-sub-sub-sub-sub-sub-level).

However, many JSON files on the internet are not quite meant for easy consumption and parse into much more complicated lists, that look more like this:

l2 <- list(x = list("no_match", list("test_noname1", "test_noname2")), y = list(a = "test_name"))
str(l2)
List of 2
 $ x:List of 2
  ..$ : chr "no_match"
  ..$ :List of 2
  .. ..$ : chr "test_noname1"
  .. ..$ : chr "test_noname2"
 $ y:List of 1
  ..$ a: chr "test_name"
list_search(l2, f = \(x) x == "test_noname1")
            x2 
"test_noname1" 

From the resulting names, I would guess that the element “x2” can be accessed like that:

l2[["x2"]]
NULL
# or maybe
l2[["x"]][[2]]
[[1]]
[1] "test_noname1"

[[2]]
[1] "test_noname2"

But to not also rake in “test_noname2” here, I actually need something like this:

l2[["x"]][[2]][[1]]
[1] "test_noname1"

Background

I often need to find the path of a known value when getting data through webscraping. The I might have a user named or URL that I know is somewhere in the data, but it's tedious to actually find it. Once one value is identified, it becomes easy to generalise to it's siblings, which are unknown so far. In the toy example, this would look like this:

l2[["x"]][[2]]
[[1]]
[1] "test_noname1"

[[2]]
[1] "test_noname2"

Only in reality, the lists I'm working with are nested much deeper.

So the issue is essentially unnamed elements in the list, that are not assigned names which are easy to generalise by unlist, or rapply for that matter. Ideally there would be an automated way to translate these into a pluck call.

3

There are 3 answers

1
G. Grothendieck On BEST ANSWER

If the question is how to get the path given the contents of a cell then using rrapply from the package of the same name

library(rrapply)

ix <- rrapply(l2, 
  condition = \(x) x == "test_noname1",
  f = \(x, .xpos) .xpos,
  how = "flatten")

unlist(ix)
## 11 12 13 
##  1  2  1 

l2[[unlist(ix)]]
## [1] "test_noname1"

library(purrr)
pluck(l2, !!!unlist(ix))
## [1] "test_noname1"

Note

Input from question

l2 <- list(x = list("no_match", list("test_noname1", "test_noname2")),
           y = list(a = "test_name"))
4
Martin Morgan On

Update

@JBGruber points out in a comment that I didn't really talk about discovering key or value paths, which was the point of the original question.

I updated the GitHub version of rjsoncons to include new functionality j_find_values(), j_find_values_grep(), j_find_keys(), j_find_keys_grep() to directly enable this.

Until available on CRAN, install with

remotes::install_github("mtmorgan/rjsoncons")

Use as

json = '{"x":["no_match",["test_noname1","test_noname2"]],"y":{"a":"test_name"}}'

j_find_values(json, "test_noname2", as = "data.frame")
##     path        value
## 1 /x/1/1 test_noname2

The key is a JSONpointer path, which j_query() supports (in addition to JMESpath and JSONpath). So the sibs are...

j_query(json, '/x/1', as = "R")
## [1] "test_noname1" "test_noname2"

The functions work directly with R objects, but the path returned is not so useful...

j_find_values(l2, "test_noname2", auto_unbox = TRUE, as = "data.frame")
##     path        value
## 1 /x/1/1 test_noname2

For more details see the help page and vignette section.

Original answer

I'll mention the CRAN package rjsoncons and JMESpath / JSONpath / JSONpointer as a way to query JSON documents directly to R objects.

I converted your R object back to json

> json = jsonlite::toJSON(l2, auto_unbox = TRUE)
> json
{"x":["no_match",["test_noname1","test_noname2"]],"y":{"a":"test_name"}}

And then explored it interactively using rjsoncons::j_query() and JMESpath

> j_query(json, "x[0]")  # JSON arrays are 0-based
[1] "no_match"
> j_query(json, "x[1]", as = "R")
[1] "test_noname1" "test_noname2"
> j_query(json, "x[1][0]", as = "R")
[1] "test_noname1"

Nested objects are queried using ., so

> j_query(json, "y.a")
[1] "test_name"

In practice I explore novel JSON documents using listviewer::jsonedit(json)

Performance and other considerations

In response to the comment by @CarlWhithoft, for problems of this size performance is obviously a very secondary consideration -- any successful approach will complete in a fraction of a second.

rjsoncons works best on the original JSON string, file or connection, rather than an R object coerced to JSON. In these cases the data is processed mostly in C, and only the result of interest returned to R. For small or medium sized JSON objects performance does not really matter, it is the flexibility of JMESpath / JSONpath, and JSONpointer that might make rjsoncons appealing, although the trade-off is learning an arcane query syntax versus an arcane set of R list manipulation commands.

One place where rjsoncons is particularly useful is when processing 'newline-delimited' JSON, where each line is a complete JSON object. Often these records have identical or very similar structure. There are several StackOverflow questions that are NDJSON-like, e.g., with each row in a data.frame a JSON object; see the rjsoncons Examples vignette or, e.g., Efficient conversion of json data in R on StackOverflow

An rjsoncons NDJSON vignette provides some real-world examples and performance comparison with competitors (DuckDB's JSON parser turns out to be really fast and scalable for the SQL pros...); there isn't a viable R-only competitor at the moment (the ndjson CRAN packages turns out to be quite slow for reasons that are not particularly clear; yyjsonr seems like it will have NDJSON parsing at some point (it was recently re-introduced on GitHub), which is likely to be 5-10x faster than rjsoncons but without the flexibility of queries; RcppSimdJson is also fast and supports JSONpointer, but JSONpointer is sometimes not flexible enough). See further disucssion in the rjsoncons NDJSON vignette.

As an aside, an R object can be translated to JSON and then queried with something like j_query(l2, "x[1]", auto_unbox = TRUE, as = "R"); this calls jsonlite::toJSON() 'under the hood'.

0
Stéphane Laurent On

Here is a way with the jsonStrings package:

library(jsonStrings)
library(jsonlite)

l2 <- list(x = list("no_match", list("test_noname1", "test_noname2")),
           y = list(a = "test_name"))
## make a jsonString
jstring <- jsonString$new(toJSON(l2, auto_unbox = TRUE))
## get all paths
paths <- jstring$flatten()$keys()
# "/x/0"   "/x/1/0" "/x/1/1" "/y/a"  
## test each path
vapply(paths, function(path) {
  jspatch <- list(list(op = "test", path = path, value = "test_noname2")) |> 
    toJSON(auto_unbox = TRUE)
  !inherits(try(jstring$patch(jspatch), silent = TRUE), "try-error")
}, logical(1L)) |> which() |> names()
# "/x/1/1