Using Aggregate and Cbind- NA issue?

957 views Asked by At

I am using cbind to find the mean of 3 different columns. However I get different answers for the means when I do:

  DFNEW <- aggregate(cbind(X1, X2, X3)~Y, DF, FUN=mean)

vs

  DFNEW <- aggregate(cbind(X1, X2)~Y, DF, FUN=mean)

The means of X1 and X2 are different when I run command 1 and when I run command 2. X1, X2, and X3 all have different numbers of NA arguments- is that the reason? Part of this may also be that I'm not entirely sure what cbind is doing in this case.

3

There are 3 answers

2
akrun On BEST ANSWER

I guess the reason why you are getting different result is because by default, na.action=na.omit for the formula interface. So, the rows with NAs are omitted and not used for the calculation of mean. When we use different combinations of columns, different rows could be deleted based on the occurrence of NA. By specifying na.action=NULL, the rows will not get deleted and we can remove the NA values while calculating the mean by using the argument na.rm=TRUE in the mean function.

  aggregate(cbind(X1, X2)~Y, DF, FUN=mean, na.rm=TRUE, na.action=NULL)
   #Y       X1       X2
   #1 A 3.625000 6.375000
   #2 B 3.000000 6.750000
   #3 C 4.142857 7.166667

  aggregate(cbind(X1, X2, X3)~Y, DF, FUN=mean, na.rm=TRUE, na.action=NULL)
  #  Y       X1       X2       X3
  #1 A 3.625000 6.375000 2.875000
  #2 B 3.000000 6.750000 1.500000
  #3 C 4.142857 7.166667 1.333333

The results we get from the above will be the same below i.e. without using the formula interface

aggregate(DF[,c("X1", "X2")], list(Y=DF[,"Y"]), FUN= mean, na.rm=TRUE)
#  Y       X1       X2
# 1 A 3.625000 6.375000
# 2 B 3.000000 6.750000
# 3 C 4.142857 7.166667


aggregate(DF[,c("X1", "X2", "X3")], list(Y=DF[,"Y"]), FUN= mean, na.rm=TRUE)
#  Y       X1       X2       X3
#1 A 3.625000 6.375000 2.875000
#2 B 3.000000 6.750000 1.500000
#3 C 4.142857 7.166667 1.333333

If you want some alternatives, you could use dplyr

 library(dplyr)
  DF %>% 
     group_by(Y) %>% 
     summarise_each(funs(mean=mean(., na.rm=TRUE)))
  # Source: local data frame [3 x 4]

  #  Y       X1       X2       X3
  #1 A 3.625000 6.375000 2.875000
  #2 B 3.000000 6.750000 1.500000
  #3 C 4.142857 7.166667 1.333333

   DF %>% 
      group_by(Y) %>%
      summarise_each(funs(mean=mean(., na.rm=TRUE)), X1, X2)
   #Source: local data frame [3 x 3]

   #  Y       X1       X2
   #1 A 3.625000 6.375000
   #2 B 3.000000 6.750000
   #3 C 4.142857 7.166667

data

set.seed(42)
DF <- data.frame(X1=sample(c(NA, 1:5), 25, replace=TRUE), 
    X2= sample(c(NA, 1:10), 25, replace=TRUE), X3= sample(c(NA,0:5), 25, 
    replace=TRUE), Y=sample(LETTERS[1:3], 25, replace=TRUE))
0
rnso On

One can also use split command:

> sapply(split(DF, DF$Y), function(x) sapply(x[1:3], mean, na.rm=T))
       A    B        C
X1 3.625 3.00 4.142857
X2 6.375 6.75 7.166667
X3 2.875 1.50 1.333333

(Using @akrun's data)

0
KFB On

The beauty of open source is diversity and array of tools available. Quite a fun. Here is a data.table alternative for your reference. Use @akrun's data.

library(data.table)
setDT(DF)[, lapply(.SD, mean, na.rm=T), by=Y]
#    Y       X1       X2       X3
# 1: C 4.142857 7.166667 1.333333
# 2: A 3.625000 6.375000 2.875000
# 3: B 3.000000 6.750000 1.500000

setDT(DF)[, lapply(.SD, mean, na.rm=T), by=Y, .SDcols=1:2]
#   Y       X1       X2
# 1: C 4.142857 7.166667
# 2: A 3.625000 6.375000
# 3: B 3.000000 6.750000