Flattening list with variable nesting levels creates additional observations

62 views Asked by At

I have a nested list of geocoded Moscow street addresses, converted from a nested list. However, the dataframe I was geocoding from had only addresses without zip codes, and in a few hundred (out of 33k) cases, the address returned multiple results for the same street address with different zipcodes. This created additional nesting in the list, which when converted to a dataframe results in a differing number of observations from the initial dataframe.

A result with only one address has the following structure: (Ignore the gibberish, R console will not render Cyrillic correctly)

structure(list(results = structure(list(address_components = list(
    structure(list(long_name = c("4", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ", 
    "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "Ðîññèÿ", 
    "127299"), short_name = c("4", "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ", 
    "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "RU", 
    "127299"), types = list("street_number", "route", c("political", 
    "sublocality", "sublocality_level_1"), c("locality", "political"
    ), c("administrative_area_level_2", "political"), c("country", 
    "political"), "postal_code")), .Names = c("long_name", "short_name", 
    "types"), class = "data.frame", row.names = c(NA, 7L))), 
    formatted_address = "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 4, Ìîñêâà, Ðîññèÿ, 127299", 
    geometry = structure(list(location = structure(list(lat = 55.8176896, 
        lng = 37.522891), .Names = c("lat", "lng"), class = "data.frame", row.names = 1L), 
        location_type = "ROOFTOP", viewport = structure(list(
            northeast = structure(list(lat = 55.8190385802915, 
                lng = 37.5242399802915), .Names = c("lat", "lng"
            ), class = "data.frame", row.names = 1L), southwest = structure(list(
                lat = 55.8163406197085, lng = 37.5215420197085), .Names = c("lat", 
            "lng"), class = "data.frame", row.names = 1L)), .Names = c("northeast", 
        "southwest"), class = "data.frame", row.names = 1L)), .Names = c("location", 
    "location_type", "viewport"), class = "data.frame", row.names = 1L), 
    partial_match = TRUE, place_id = "ChIJ59yLsy1ItUYR5EEBFbFJoSA", 
    types = list("street_address")), .Names = c("address_components", 
"formatted_address", "geometry", "partial_match", "place_id", 
"types"), class = "data.frame", row.names = 1L), status = "OK"), .Names = c("results", 
"status"))

Whereas a result with multiple possible addresses looks like:

structure(list(results = structure(list(address_components = list(
    structure(list(long_name = c("23", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ", 
    "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "Ðîññèÿ", 
    "127299"), short_name = c("23", "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ", 
    "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "RU", 
    "127299"), types = list("street_number", "route", c("political", 
    "sublocality", "sublocality_level_1"), c("locality", "political"
    ), c("administrative_area_level_2", "political"), c("country", 
    "political"), "postal_code")), .Names = c("long_name", "short_name", 
    "types"), class = "data.frame", row.names = c(NA, 7L)), structure(list(
        long_name = c("23", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ", "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", 
        "Ìîñêâà", "Ìîñêâà", "Ðîññèÿ", "125008"), short_name = c("23", 
        "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ", "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", 
        "Ìîñêâà", "Ìîñêâà", "RU", "125008"), types = list("street_number", 
            "route", c("political", "sublocality", "sublocality_level_1"
            ), c("locality", "political"), c("administrative_area_level_2", 
            "political"), c("country", "political"), "postal_code")), .Names = c("long_name", 
    "short_name", "types"), class = "data.frame", row.names = c(NA, 
    7L))), formatted_address = c("óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 23, Ìîñêâà, Ðîññèÿ, 127299", 
"óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 23, Ìîñêâà, Ðîññèÿ, 125008"), geometry = structure(list(
    location = structure(list(lat = c(55.8169112, 55.826859), 
        lng = c(37.5202899, 37.529427)), .Names = c("lat", "lng"
    ), class = "data.frame", row.names = 1:2), location_type = c("ROOFTOP", 
    "ROOFTOP"), viewport = structure(list(northeast = structure(list(
        lat = c(55.8182601802915, 55.8282079802915), lng = c(37.5216388802915, 
        37.5307759802915)), .Names = c("lat", "lng"), class = "data.frame", row.names = 1:2), 
        southwest = structure(list(lat = c(55.8155622197085, 
        55.8255100197085), lng = c(37.5189409197085, 37.5280780197085
        )), .Names = c("lat", "lng"), class = "data.frame", row.names = 1:2)), .Names = c("northeast", 
    "southwest"), class = "data.frame", row.names = 1:2)), .Names = c("location", 
"location_type", "viewport"), class = "data.frame", row.names = 1:2), 
    partial_match = c(TRUE, TRUE), place_id = c("ChIJnVMw7C1ItUYRdfeWEQrXuAk", 
    "ChIJnbnwOdY3tUYR1_D9pHTqCsI"), types = list("street_address", 
        "street_address")), .Names = c("address_components", 
"formatted_address", "geometry", "partial_match", "place_id", 
"types"), class = "data.frame", row.names = 1:2), status = "OK"), .Names = c("results", 
"status"))

In the results element in the second list, there is an additional level of nesting for each possible address, which when flattened creates an "extra" observation for that address, making it impossible to cbind() the geocoding results back to the list of addresses. I am using the following functions to flatten my nested lists to data-frames. How can I modify them to take only the first address when this additional nesting occurs? If the address is incorrect, the buildings will simply be discarded from the sample when I later merge with another dataframe, so I am concerned only with making each geocoded observation match to the appropriate row in the original dataframe (the source of the addresses).

flatten_googleway <- function(df) {
  require(jsonlite)
  res <- jsonlite::flatten(df)
  res[, names(res) %in% c("geometry.location_type", "geometry.location.lat", 
                          "geometry.location.lng", "formatted_address")]
}
moscowhousegeo.df <- do.call(rbind, lapply(moscowhouse.list, function(x) {
  if (length(x$results) == 0) template_res[1, ] else flatten_googleway(x$results)
}))

##template for NA results
structure(list(formatted_address = character(0), geometry.location_type = character(0), 
    geometry.location.lat = numeric(0), geometry.location.lng = numeric(0)), .Names = c("formatted_address", 
"geometry.location_type", "geometry.location.lat", "geometry.location.lng"
), row.names = integer(0), class = "data.frame")
1

There are 1 answers

0
Sean Norton On BEST ANSWER

Whoops, I was massively over-complicating things, as usual. I was able to fix this simply by modifying the lapply() call to replace all list elements with no results, and elements where x$results$address_components is greater than length 1 (as is the case when multiple possible results are returned).

moscowhousegeo.df <- do.call(rbind, lapply(moscowhouse.list, function(x) {
  if (length(x$results) == 0 | length(x$results$formatted_address) > 1) template_res[1, ] else flatten_googleway(x$results)
}))

I still lose some data this way unfortunately, but identifying which address is correct out of the options given would likely be too time-consuming, and a bit silly in a dataset with so many observations.