Unnesting an R dataframe from a JSON string in .txt file

Question

Unnesting an R dataframe from a JSON string in .txt file

28 views Asked by Davidmac At 22 September 2023 at 08:14

I am new to using data in JSON strings.

I have quite a similar issue to @Christopher here (Parsing JSON arrays from a .txt file in R - several large files) with the US Energy Information Administration bulk dataset.

I was able to parse the json string in the .txt file to R df using stream_in instruction in the link above. But I don't know how to proceed to unnest all the data into a long format for each variable under col(name).

Here is my onedrive link to the data https://1drv.ms/t/s!Au35hXWxiOIzh-0l_Th6cSIzLObSzg?e=FN2OeP

Any help would be appreciated.

Thanks

#library(jsonlite)
#library(tidyverse)

#filepath <- "https://1drv.ms/t/s!Au35hXWxiOIzh-0l_Th6cSIzLObSzg?e=FN2OeP"
#content <- file(filepath, open = "rt")
#seds <- stream_in(content)
#glimpse(seds)

#seds.unnest <- seds %>% unnest(cols = c(series_id, name, units, f, description, copyright, source, #iso3166, geography, start, end, last_updated, geoset_id, data))
#seds.unnest <- t(seds.unnest) %>% data.frame()

Original Q&A

There are 1 answers

**r2evans** · Answer 1 · 2023-09-22T13:15:35+00:00

You have four datasets in there, not one:

system("sed -E 's/:.*//g' ~/Downloads/SEDS.txt | sort | uniq -c")
#     416 
#     972 {"category_id"
#     623 {"geoset_id"
#       1 {"relation_id"
#   37533 {"series_id"

I think we read them in individually. We could use readLines and grab 48MB+ at once, but R's global string pool might punish you a little ... I suggest this is a little faster and cleaner anyway. (These data still are rather large in R's memory due to all of the strings ...)

category <- jsonlite::stream_in(pipe("grep '^{\"category_id\"' ~/Downloads/SEDS.txt"))
# opening pipe input connection.
#  Imported 972 records. Simplifying...
# closing pipe input connection.
str(category)
# 'data.frame': 972 obs. of  5 variables:
#  $ category_id       : chr  "40204" "40208" "40961" "40605" ...
#  $ parent_category_id: chr  "40203" "40204" "40208" "40961" ...
#  $ name              : chr  "Consumption" "Total End-Use Sectors" "All Petroleum Products" "All Petroleum Products" ...
#  $ notes             : chr  "" "" "" "" ...
#  $ childseries       :List of 972
#   ..$ : chr 
#   ..$ : chr  "SEDS.TETXB.AK.A" "SEDS.TETXB.AL.A" "SEDS.TETXB.AR.A" "SEDS.TETXB.AZ.A" ...
#   ..$ : chr 
#   ..$ : chr  "SEDS.PATXP.AK.A" "SEDS.PATXP.AL.A" "SEDS.PATXP.AR.A" "SEDS.PATXP.AZ.A" ...
#   ..$ : chr  "SEDS.PATXB.AK.A" "SEDS.PATXB.AL.A" "SEDS.PATXB.AR.A" "SEDS.PATXB.AZ.A" ...
#   ..$ : chr 
#   .. [list output truncated]

The childseries column is a list-column, which means normal string-ops will not work on it as-is. See the work on series below for some thoughts on how to work with this type of data, though I don't think converting to a frame and then unnesting will work with this childseries.

geoset is much simpler:

geoset <- jsonlite::stream_in(pipe("grep '^{\"geoset_id\"' ~/Downloads/SEDS.txt"))
# opening pipe input connection.
#  Imported 623 records. Simplifying...
# closing pipe input connection.
str(geoset)
# 'data.frame': 623 obs. of  3 variables:
#  $ geoset_id: chr  "SEDS.ABICB.A" "SEDS.ABICP.A" "SEDS.ARICB.A" "SEDS.ARICD.A" ...
#  $ name     : chr  "Aviation gasoline blending components consumed by the industrial sector." "Aviation gasoline blending components consumed by the industrial sector." "Asphalt and road oil consumed by the industrial sector." "Asphalt and road oil price in the industrial sector." ...
#  $ units    : chr  "Billion Btu" "Thousand barrels" "Billion Btu" "Dollars per million Btu" ...
head(geoset)
#      geoset_id                                                                     name                   units
# 1 SEDS.ABICB.A Aviation gasoline blending components consumed by the industrial sector.             Billion Btu
# 2 SEDS.ABICP.A Aviation gasoline blending components consumed by the industrial sector.        Thousand barrels
# 3 SEDS.ARICB.A                  Asphalt and road oil consumed by the industrial sector.             Billion Btu
# 4 SEDS.ARICD.A                     Asphalt and road oil price in the industrial sector. Dollars per million Btu
# 5 SEDS.ARICP.A                  Asphalt and road oil consumed by the industrial sector.        Thousand barrels
# 6 SEDS.ARICV.A              Asphalt and road oil expenditures in the industrial sector.         Million dollars

I'll leave relation_id as an exercise.

The last, series_id, is similar, though I'll demonstrate an optional post-read step on it; if you like keeping it as a list-column, then feel free to stop after stream_in. I'm using dplyr::mutate and tidyr::unnest here, though if you're in a package-pinch this could be done in base R.

library(dplyr)
series <- jsonlite::stream_in(pipe("grep '^{\"series_id\"' ~/Downloads/SEDS.txt")) |>
  mutate(data = lapply(data, function(z) setNames(as.data.frame(z), c("year", "value")))) |>
  tidyr::unnest(data)
nrow(series)
# [1] 1967679
series
# # A tibble: 1,967,679 × 15
#    series_id       name                                                                   units                     f     description                                                                    copyright source iso3166 geography start end   last_updated geoset_id year  value
#    <chr>           <chr>                                                                  <chr>                     <chr> <chr>                                                                          <chr>     <chr>  <chr>   <chr>     <chr> <chr> <chr>        <chr>     <chr> <chr>
#  1 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2021  0    
#  2 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2020  0    
#  3 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2019  0    
#  4 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2018  0    
#  5 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2017  0    
#  6 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2016  0    
#  7 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2015  0    
#  8 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2014  0    
#  9 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2013  0    
# 10 SEDS.CLPRK.IA.A Factor for converting coal production from physical units to Btu, Iowa Million Btu per short ton A     Factor for converting coal production from physical units to Btu, Iowa See ht… None      SEDS,… USA-IA  USA-IA    1960  2021  2023-06-24T… SEDS.CLP… 2012  0    
# # ℹ 1,967,669 more rows
# # ℹ Use `print(n = ...)` to see more rows

TechQA.

Unnesting an R dataframe from a JSON string in .txt file

There are 1 answers

Related Questions in R

Related Questions in JSON

Related Questions in DATA-SCIENCE

Related Questions in DATA-SCRUBBING

Related Questions in DATA-SCIENCE-STUDIO

Popular Questions

Trending Questions