Can I access results of "setup_data" from "map_data" in ggpplot2 ggproto?

(works fine for "compute_layout" but not "map_data")

Hi folks. I'm working on a ggplot2 extension that will implement a new faceting method.

I don't want to get into the nitty gritty of the algorithm, but suffice it to say that I need to first compute some new columns for each row of the input data, and only then can I perform a compute_layout and map_data.

Of course, one option is to compute my new columns twice, once inside of compute_layout and once again inside of map_data, but this will be twice as expensive computationally, and just less elegant.

It seems that setup_params and setup_data are meant for this exact use case.

What Doesn't Work ❌

I'm creating a little reproducible example based off this great vignette.

I've just made a small modification that tries to add a hello column to the data using the setup_data function.

library(ggplot2)
facet_bootstrap <- function(n = 9, prop = 0.2, nrow = NULL, ncol = NULL, 
                            scales = "fixed", shrink = TRUE, strip.position = "top") {
  facet <- facet_wrap(~.bootstrap, nrow = nrow, ncol = ncol, scales = scales, 
                      shrink = shrink, strip.position = strip.position)
  facet$params$n <- n
  facet$params$prop <- prop
  ggproto(NULL, FacetBootstrap,
          shrink = shrink,
          params = facet$params
  )
}
FacetBootstrap <- ggproto("FacetBootstrap", FacetWrap,
                          setup_data = function(data, params){
                            data[[1]]$hello <- 'world'
                            print("In SETUP_DATA:")
                            print("   names(data):")
                            print(names(data[[1]]))
                            print("")
                            data
                          },
                          compute_layout = function(data, params) {
                            id <- seq_len(params$n)
                            print("In COMPUTE_LAYOUT:")
                            print("   names(data):")
                            print(names(data[[1]]))
                            print("")
                            dims <- wrap_dims(params$n, params$nrow, params$ncol)
                            layout <- data.frame(PANEL = factor(id))
                            if (params$as.table) {
                              layout$ROW <- 1+as.integer((id - 1L) %/% dims[2] + 1L)
                            } else {
                              layout$ROW <- 1+as.integer(dims[1] - (id - 1L) %/% dims[2])
                            }
                            layout$COL <- 2+as.integer((id - 1L) %% dims[2] + 1L)
                            layout <- layout[order(layout$PANEL), , drop = FALSE]
                            rownames(layout) <- NULL
                            # Add scale identification
                            layout$SCALE_X <- if (params$free$x) id else 1L
                            layout$SCALE_Y <- if (params$free$y) id else 1L
                            cbind(layout, .bootstrap = id)
                          },
                          map_data = function(data, layout, params) {
                            print("In MAP_DATA:")
                            print("   names(data):")
                            print(names(data))
                            print("")
                            if (is.null(data) || nrow(data) == 0) {
                              return(cbind(data, PANEL = integer(0)))
                            }
                            n_samples <- round(nrow(data) * params$prop)
                            new_data <- lapply(seq_len(params$n), function(i) {
                              cbind(data[sample(nrow(data), n_samples), , drop = FALSE], PANEL = i)
                            })
                            do.call(rbind, new_data)
                          }
)
ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 0.1) + 
  facet_bootstrap(n = 9, prop = 0.05)

with outputs:

[1] "In SETUP_DATA:"
[1] "   names(data):"
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"  
 [7] "price"   "x"       "y"       "z"       "hello"  
[1] ""
[1] "In COMPUTE_LAYOUT:"
[1] "   names(data):"
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"  
 [7] "price"   "x"       "y"       "z"       "hello"  
[1] ""
[1] "In MAP_DATA:"
[1] "   names(data):"
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"  
 [7] "price"   "x"       "y"       "z"      
[1] ""

notice how my hello column is available in compute_layout but not map_data

What DOES Work ✅

As a workaround hack, I CAN create some columns and pass them through as parameters using setup_params. This is a bit gross because they're not "parameters" conceptually, they're data. But if all else fails - I will take this approach

library(ggplot2)
facet_bootstrap <- function(n = 9, prop = 0.2, nrow = NULL, ncol = NULL, 
                            scales = "fixed", shrink = TRUE, strip.position = "top") {
  facet <- facet_wrap(~.bootstrap, nrow = nrow, ncol = ncol, scales = scales, 
                      shrink = shrink, strip.position = strip.position)
  facet$params$n <- n
  facet$params$prop <- prop
  ggproto(NULL, FacetBootstrap,
          shrink = shrink,
          params = facet$params
  )
}
FacetBootstrap <- ggproto("FacetBootstrap", FacetWrap,
                          setup_params = function(data, params){
                            params$hello <- 'world'
                            print("In SETUP_DATA:")
                            print("   params$hello:")
                            print(params$hello)
                            print("")
                            params
                          },
                          compute_layout = function(data, params) {
                            id <- seq_len(params$n)
                            print("In COMPUTE_LAYOUT:")
                            print("   params$hello:")
                            print(params$hello)
                            print("")
                            dims <- wrap_dims(params$n, params$nrow, params$ncol)
                            layout <- data.frame(PANEL = factor(id))
                            if (params$as.table) {
                              layout$ROW <- 1+as.integer((id - 1L) %/% dims[2] + 1L)
                            } else {
                              layout$ROW <- 1+as.integer(dims[1] - (id - 1L) %/% dims[2])
                            }
                            layout$COL <- 2+as.integer((id - 1L) %% dims[2] + 1L)
                            layout <- layout[order(layout$PANEL), , drop = FALSE]
                            rownames(layout) <- NULL
                            # Add scale identification
                            layout$SCALE_X <- if (params$free$x) id else 1L
                            layout$SCALE_Y <- if (params$free$y) id else 1L
                            cbind(layout, .bootstrap = id)
                          },
                          map_data = function(data, layout, params) {
                            print("In MAP_DATA:")
                            print("   params$hello:")
                            print(params$hello)
                            print("")
                            if (is.null(data) || nrow(data) == 0) {
                              return(cbind(data, PANEL = integer(0)))
                            }
                            n_samples <- round(nrow(data) * params$prop)
                            new_data <- lapply(seq_len(params$n), function(i) {
                              cbind(data[sample(nrow(data), n_samples), , drop = FALSE], PANEL = i)
                            })
                            do.call(rbind, new_data)
                          }
)
ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 0.1) + 
  facet_bootstrap(n = 9, prop = 0.05)

with the following output

[1] "In SETUP_DATA:"
[1] "   params$hello:"
[1] "world"
[1] ""
[1] "In COMPUTE_LAYOUT:"
[1] "   params$hello:"
[1] "world"
[1] ""
[1] "In MAP_DATA:"
[1] "   params$hello:"
[1] "world"
[1] ""

Summary of the results

  • When I modify data from "setup_data" i CAN access it in "compute_layout" ✅
  • When I modify data from "setup_data" i CANNOT access it in "map_data" ❌
  • When I modify data from "setup_params" i CAN access it in "compute_layout" ✅
  • When I modify data from "setup_params" i CAN access it in "map_data" ✅

Final Question(s)

  • Are the above results expected, or am I just thinking about things incorrectly?
  • what would be the ideal way to compute new columns for data ONCE, and then make them available for both map_data AND compute_layout for a Facet ggproto?

Thanks in advance!

1

There are 1 answers

2
teunbrand On BEST ANSWER

TL;DR: set a new column in every list-element of data in the setup_data function.

It seems that setup_params and setup_data are meant for this exact use case.

That's right, but I get the impression from your question that some confusion exists about the order of operations of data ingestion. Facets and coordinates are part of the 'layout' of a plot. Before the layout is setup, layers setup their data (sometimes making a copy of the global data). Then, the layout can inspect the data and make adjustments (typically appending a PANEL column). If we inspect/print to console ggplot2:::Layout$setup, we see the following (comments by me):

<ggproto method>
  <Wrapper function>
    function (...) 
f(..., self = self)

  <Inner function (f)>
    function (self, data, plot_data = new_data_frame(), plot_env = emptyenv()) 
{
    data <- c(list(plot_data), data)

    # First `setup_params` is used
    self$facet_params <- self$facet$setup_params(data, self$facet$params)
    self$facet_params$plot_env <- plot_env

    # Second, `setup_data` is used
    data <- self$facet$setup_data(data, self$facet_params)
    self$coord_params <- self$coord$setup_params(data)
    data <- self$coord$setup_data(data, self$coord_params)
    
    # Third, `compute_layout` is used.
    self$layout <- self$facet$compute_layout(data, self$facet_params)
    self$layout <- self$coord$setup_layout(self$layout, self$coord_params)
    check_layout(self$layout)
    
    # Lastly, `map_data` is used for every data *except* the global data!
    lapply(data[-1], self$facet$map_data, layout = self$layout, 
        params = self$facet_params)
}

So from this we learn that the order of operations is setup_params --> setup_data --> compute_layout --> map_data. Note that map_data starts with lapply(data[-1], ...) wherein data is a list with data.frames with the global data in position 1 and layer data thereafter.

Your setup_data method only applies data[[1]]$hello <- 'world' to the global data and not to layer data. Replacing that line with data <- lapply(data, cbind, hello = "world") applies it to global data and layer data. At this point, every layer already has it's own (copy of the global) data, so from an efficiency standpoint, there is not a lot the facets can do to efficiently append a column to the global data that layers can 'inherit'.

To be more explicit, this is what I'm proposing:

FacetBootstrap <- ggproto(
  "FacetBootstrap", FacetWrap,
  setup_data = function(data, params){
    data <- lapply(data, cbind, hello = "world")
    print("In SETUP_DATA:")
    print("   names(data):")
    print(names(data[[1]]))
    print("")
    data
  },
  ...other code...
)