R: Issues when pulling values from a nested list into a dataframe

193 views Asked by At

so this should be a relatively easy question on pulling items in a list into a dataframe, but I'm stuck on something.

I have the following list (I'm showing just part of the list for you, it's far longer than this):

str(raw_jobs_list)

List of 2
 $ :List of 4
  ..$ id    : chr "3594134"
  ..$ score : int 1
  ..$ fields:List of 16
  .. ..$ date             :List of 3
  .. .. ..$ changed: chr "2020-04-18T00:35:00+00:00"
  .. .. ..$ created: chr "2020-04-07T11:15:37+00:00"
  .. .. ..$ closing: chr "2020-04-17T00:00:00+00:00"
  .. ..$ country          :List of 1
  .. .. ..$ :List of 6
  .. .. .. ..$ href     : chr "https://api.reliefweb.int/v1/countries/149"
  .. .. .. ..$ name     : chr "Mali"
  .. .. .. ..$ location :List of 2
  .. .. .. .. ..$ lon: num -1.25
  .. .. .. .. ..$ lat: num 17.4
  .. .. .. ..$ id       : int 149
  .. .. .. ..$ shortname: chr "Mali"
  .. .. .. ..$ iso3     : chr "mli"
  .. ..$ title            : chr "REGIONAL MANAGER West Africa"

I tried pulling them out using:

jobs_data_df <- list.stack(list.select(raw_jobs_list, 
                                       fields$title, 
                                       fields$country$name,
                                       fields$date$created))

Where raw_jobs_list is the list, but I get these NAs and am not sure how to get past it.

glimpse(jobs_data_df)
Rows: 2
Columns: 3
$ V1 <chr> "REGIONAL MANAGER West Africa", "Support Relief Group Public Health Advisor (Multiple Positions)"
$ V2 <lgl> NA, NA
$ V3 <chr> "2020-04-07T11:15:37+00:00", "2020-05-04T15:20:37+00:00"

It's possible there's something obvious I'm overlooking as I haven't worked much with lists before. Any ideas?

Thanks so much! C

PS. If you're interested, I'm working with this API and this is how I got there so far.

jobs <- GET(url = "https://api.reliefweb.int/v1/jobs?appname=apidoc&preset=analysis&profile=full&limit=2")


raw_jobs_list <- content(jobs)$data

The portion displayed above is a subset of the whole data; here is a portion of the first element of the list:

dput(lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))]))))

list(list(id = "3594134", score = 1L, fields = list(date = list(
    changed = "2020-04-18T00:35:00+00:00", created = "2020-04-07T11:15:37+00:00", 
    closing = "2020-04-17T00:00:00+00:00"), country = list(list(
    href = "https://api.reliefweb.int/v1/countries/149", name = "Mali", 
    location = list(lon = -1.25, lat = 17.35), id = 149L, shortname = "Mali", 
    iso3 = "mli")), title = "REGIONAL MANAGER West Africa")), 
    list(id = "3594129", score = 1L, fields = list(date = list(
        changed = "2020-05-19T00:04:01+00:00", created = "2020-05-04T15:20:37+00:00", 
        closing = "2020-05-18T00:00:00+00:00"), title = "Support Relief Group Public Health Advisor (Multiple Positions)")))
1

There are 1 answers

2
r2evans On BEST ANSWER

If you look at just one element at a time, I think that as.data.frame does a pretty decent job. While I'll demonstrate using the abbreviated data (that I edited into your question), and the first element looks like:

raw_jobs_sublist <- lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))

as.data.frame(raw_jobs_sublist[[1]])
#        id score       fields.date.changed       fields.date.created       fields.date.closing                        fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3                 fields.title
# 1 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149                Mali                       -1.25                       17.35               149                     Mali                 mli REGIONAL MANAGER West Africa

Shown differently (just for variety here), it's

str(as.data.frame(raw_jobs_sublist[[1]]))
# 'data.frame': 1 obs. of  13 variables:
#  $ id                         : chr "3594134"
#  $ score                      : int 1
#  $ fields.date.changed        : chr "2020-04-18T00:35:00+00:00"
#  $ fields.date.created        : chr "2020-04-07T11:15:37+00:00"
#  $ fields.date.closing        : chr "2020-04-17T00:00:00+00:00"
#  $ fields.country.href        : chr "https://api.reliefweb.int/v1/countries/149"
#  $ fields.country.name        : chr "Mali"
#  $ fields.country.location.lon: num -1.25
#  $ fields.country.location.lat: num 17.4
#  $ fields.country.id          : int 149
#  $ fields.country.shortname   : chr "Mali"
#  $ fields.country.iso3        : chr "mli"
#  $ fields.title               : chr "REGIONAL MANAGER West Africa"

In order to do this on all elements, we need to account for a few things:

  • not all elements have all fields, so whatever method we use needs to "fill" in the blanks;
  • we don't want to do it iteratively, let's just combine them all at once.

Here's a first stab:

dplyr::bind_rows(lapply(raw_jobs_sublist, as.data.frame))
#        id score       fields.date.changed       fields.date.created       fields.date.closing                        fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3                                                    fields.title
# 1 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149                Mali                       -1.25                       17.35               149                     Mali                 mli                                    REGIONAL MANAGER West Africa
# 2 3594129     1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00                                       <NA>                <NA>                          NA                          NA                NA                     <NA>                <NA> Support Relief Group Public Health Advisor (Multiple Positions)

This also works with data.table::rbindlist. It does not work as well with do.call(rbind.data.frame, ...), since that is less tolerant of missing names. (This it can be done without too much trouble, there are occasionally other advantages to using these two options.)

Note: if you do this on the original data, R's default mechanism of displaying a data.frame will cramp your console with all of the text, which might be annoying. If you are already using dplyr or data.table in any of your work, both of those formats provide string-limiting, so that it is more tolerable on the console. For example, showing the whole thing:

tibble::tibble(dplyr::bind_rows(lapply(raw_jobs_list, as.data.frame)))
# # A tibble: 2 x 42
#   id    score fields.date.cha~ fields.date.cre~ fields.date.clo~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.career_c~ fields.career_c~ fields.name fields.source.h~ fields.source.n~ fields.source.id fields.source.t~ fields.source.t~ fields.source.s~ fields.source.h~ fields.title fields.body
#   <chr> <int> <chr>            <chr>            <chr>            <chr>            <chr>                       <dbl>            <dbl>            <int> <chr>            <chr>            <chr>                       <int> <chr>       <chr>            <chr>                       <int> <chr>                       <int> <chr>            <chr>            <chr>        <chr>      
# 1 3594~     1 2020-04-18T00:3~ 2020-04-07T11:1~ 2020-04-17T00:0~ https://api.rel~ Mali                        -1.25             17.4              149 Mali             mli              Donor Relations~            20966 Bamako      https://api.rel~ ICCO COOPERATION            45059 Non-governmenta~              274 ICCO COOPERATION https://www.icc~ REGIONAL MA~ "**VACANCY~
# 2 3594~     1 2020-05-19T00:0~ 2020-05-04T15:2~ 2020-05-18T00:0~ <NA>             <NA>                        NA                NA                 NA <NA>             <NA>             Program/Project~             6867 <NA>        https://api.rel~ US Agency for I~             1751 Government                    271 USAID            http://www.usai~ Support Rel~ "### **SOL~
# # ... with 18 more variables: fields.type.name <chr>, fields.type.id <int>, fields.experience.name <chr>, fields.experience.id <int>, fields.url <chr>, fields.url_alias <chr>, fields.how_to_apply <chr>, fields.id <int>, fields.status <chr>, fields.body.html <chr>, fields.how_to_apply.html <chr>, href <chr>, fields.source.longname <chr>, fields.source.spanish_name <chr>,
# #   fields.theme.name <chr>, fields.theme.id <int>, fields.theme.name.1 <chr>, fields.theme.id.1 <int>

data.table::rbindlist(lapply(raw_jobs_list, as.data.frame), fill = TRUE)
#         id score       fields.date.changed       fields.date.created       fields.date.closing                     fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3     fields.career_categories.name fields.career_categories.id fields.name
#     <char> <int>                    <char>                    <char>                    <char>                                  <char>              <char>                       <num>                       <num>             <int>                   <char>              <char>                            <char>                       <int>      <char>
# 1: 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countri...                Mali                       -1.25                       17.35               149                     Mali                 mli Donor Relations/Grants Management                       20966      Bamako
# 2: 3594129     1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00                                    <NA>                <NA>                          NA                          NA                NA                     <NA>                <NA>        Program/Project Management                        6867        <NA>
# 27 variables not shown: [fields.source.href <char>, fields.source.name <char>, fields.source.id <int>, fields.source.type.name <char>, fields.source.type.id <int>, fields.source.shortname <char>, fields.source.homepage <char>, fields.title <char>, fields.body <char>, fields.type.name <char>, ...]

For data.table, I already have some options set that facilitate this. Notably, I'm currently using:

options(
  datatable.prettyprint.char = 36,
  datatable.print.topn = 10,
  datatable.print.class = TRUE,
  datatable.print.trunc.cols = TRUE
)

At this point, you have a data.frame that should contain all of the data (and NA for elements with missing fields). From here, if you don't like the nested-names convention (e.g., fields.date.changed), then they can be easily renamed using patterns or conventional methods.