Why do i got different results using SE or NSE dplyr functions

Question

Why do i got different results using SE or NSE dplyr functions

79 views Asked by Julien Navarre At 13 August 2015 at 14:53

Hi I got differents results from dplyr function when I use standard evaluation through lazyeval package.

Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.

library(dplyr)
# random datetime generation function by Dirk Eddelbuettel
# http://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/13") {
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- st + ev
}

set.seed(42)
# Creating 230000 ids couples
ids <- data_frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"), 
                  id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups    
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
datas <- mutate(ids, datetime = rand.datetime(25e4))

When I use the NSE way I got 230000 rows

df1 <- 
  datas %>% 
  group_by(id1, id2) %>% 
  filter(datetime == max(datetime))
nrow(df1) #230000

But when I use the SE, I got only 229977 rows

ids <- c("id1", "id2")
filterVar <- "datetime"
filterFun <- "max"
df2 <- 
  datas %>% 
  group_by_(ids) %>% 
  filter_(.dots = lazyeval::interp(~var == fun(var), 
                                   var = as.name(filterVar), 
                                   fun = as.name(filterFun)))
nrow(df2) #229977

My two pieces of code are equivalent right ? Why do I experience different results ? Thanks.

Original Q&A

There are 1 answers

**aosmith** · Accepted Answer · 2015-08-13T15:32:11+00:00

You'll need to specify the .dots argument in group_by_ when giving a vector of column names.

df2 <- datas %>% 
    group_by_(.dots = ids) %>% 
    filter_(.dots = lazyeval::interp(~var == fun(var), 
                               var = as.name(filterVar), 
                               fun = as.name(filterFun)))
nrow(df2)
[1] 230000

It looks like group_by_ might take the first column name from the vector as the only grouping variable when you don't specify the .dots argument. You can check this by grouping on id1 only.

df1 <- datas %>% 
    group_by(id1) %>% 
    filter(datetime == max(datetime))
 nrow(df1)
[1] 229977

(If you group just on id2 the number of rows is 229976).

TechQA.

Why do i got different results using SE or NSE dplyr functions

There are 1 answers

Related Questions in R

Related Questions in DPLYR

Related Questions in STANDARD-EVALUATION

Popular Questions

Popular Tags

Trending Questions