I am new to using data in JSON strings.
I have quite a similar issue to @Christopher here (Parsing JSON arrays from a .txt file in R - several large files) with the US Energy Information Administration bulk dataset.
I was able to parse the json string in the .txt file to R df using stream_in instruction in the link above. But I don't know how to proceed to unnest all the data into a long format for each variable under col(name).
Here is my onedrive link to the data https://1drv.ms/t/s!Au35hXWxiOIzh-0l_Th6cSIzLObSzg?e=FN2OeP
Any help would be appreciated.
Thanks
#library(jsonlite)
#library(tidyverse)
#filepath <- "https://1drv.ms/t/s!Au35hXWxiOIzh-0l_Th6cSIzLObSzg?e=FN2OeP"
#content <- file(filepath, open = "rt")
#seds <- stream_in(content)
#glimpse(seds)
#seds.unnest <- seds %>% unnest(cols = c(series_id, name, units, f, description, copyright, source, #iso3166, geography, start, end, last_updated, geoset_id, data))
#seds.unnest <- t(seds.unnest) %>% data.frame()
You have four datasets in there, not one:
I think we read them in individually. We could use
readLinesand grab 48MB+ at once, but R's global string pool might punish you a little ... I suggest this is a little faster and cleaner anyway. (These data still are rather large in R's memory due to all of the strings ...)The
childseriescolumn is a list-column, which means normal string-ops will not work on it as-is. See the work onseriesbelow for some thoughts on how to work with this type of data, though I don't think converting to a frame and then unnesting will work with thischildseries.geosetis much simpler:I'll leave
relation_idas an exercise.The last,
series_id, is similar, though I'll demonstrate an optional post-read step on it; if you like keeping it as a list-column, then feel free to stop afterstream_in. I'm usingdplyr::mutateandtidyr::unnesthere, though if you're in a package-pinch this could be done in base R.