dot stored in call to update formula leads to scoping issue

57 views Asked by At

I am relying on the compareGroups package to do some comparisons after a pipe-chain. When subsetting the final results, the call to [ triggers a call to update (both in their bespoke compareGroups-versions) which leads to a scoping problem.

Try this:

library(tidyverse)
# install.packages("compareGroups")
library(compareGroups)

get_data <- function() return(mtcars)

assign_group <- function(df) {
  n <- nrow(df)
  df$group <- rbinom(n, 1, 0.5)
  return(df)
}

get_results <- function(){
  get_data() %>% assign_group %>% compareGroups(group ~ ., data = .)
}

res <- get_results()
# all the above works, but the following triggers the error:
res["mpg"]

This leads to the following error:

Error in compareGroups(formula = group ~ mpg, data = .) : object '.' not found

The relevant (abbreviated) traceback is this:

compareGroups(formula = group ~ mpg, data = .) 
eval(call, parent.frame()) 
update.compareGroups(x, formula = group ~ mpg) 
update(x, formula = group ~ mpg) at <text>#1
eval(parse(text = cmd)) 
`[.compareGroups`(res, "mpg") 
res["mpg"] 

So, my understanding is that that the dot-notation in the dplyr pipe-chain prevents the update-call to find the dataframe, which is stored as . in the call. So, the error makes sense as neither . is not the name of the dataframe, nor available outside of the scope of the function get_results (though the main issue is the .). One obvious way of avoiding this error is by fixing the update.compareGroups function - I don't think we need another call to the package to redo all calculations when I simply want to retrieve individual results (which have already been calculated).

However, this is a more general issue with the . notation of dplyr and the fact it is stored in the call. This problem seems general enough so that I would imagine someone has encountered it before, and has found a more general solution?

1

There are 1 answers

2
Peter Smittenaar On

Firstly, I don't think piping your data into compareGroups makes sense - remember that piping means the first argument to compareGroups() is now the dataframe, even though the function specification is:

compareGroups(formula, data, ...)

Secondly, this dplyr vignette shows you can use .data instead of just . to access the piped data. However, in this case the following will cause a crash giving message data argument will be ignored since formula is already a data set (due to the data being piped into first argument).

get_results <- function(){
  get_data() %>% assign_group %>% compareGroups(group ~ ., data = .data)  # does NOT work
}

Making a separate call to compareGroups without piping then gets me into an unholy mess of environments whereby res does not have access to the data when requesting res['mpg'] outside the function get_results(), as you already alluded to with the scoping problem. I think this is a compareGroups problem, because if I use the same architecture with glm there's no such problem. So best I can do is to take the dataframe out of the function environment, which I think doesn't properly answer your question:

get_data <- function() return(mtcars)

assign_group <- function(df) {
    n <- nrow(df)
    df$group <- rbinom(n, 1, 0.5)
    return(df)
}
df = get_data() %>% assign_group()
res = compareGroups(group ~ ., data = df)
print(res['mpg'])

But I hope the first two points I made get you closer to an answer.