Replacing missing data with the mean of a subgroup in R

315 views Asked by At

I have a table in which there is some missing data that I'd like to replace with the mean of the some other, related data, based on certain conditions. I have some toy data to show the problem below:

var1    var2    var3
123.1    2.1    113
166.5    2.1    113
200.3    2.1    112
 NA      2.1    113
 NA      2.1     NA 
212.1    3.3    112   
 ...     ...    ... 

What I'd like to do is be able to to fill in the NA values for var1 with the mean of va1 in the case where they both have the same var2 and var3.

ie, for the first NA in the var1 column, this matches on both var2 and var3 with the 1st and 2nd entries, so this would get the value of (123.1 + 166.5) / 2 .

The second NA in the var1 column is missing the var3 information so would be given the mean of all the other var1 values where var2 = 2.1.

I'm relatively new to R and I can't seem to get the conditional logic correct - thanks in advance!

2

There are 2 answers

2
AdamO On

What I'd like to do is be able to to fill in the NA values for var1 with the mean of var2 in the case where they both have the same var3.

Hmm... I don't think that's what you want, but to do that:

means <- tapply(var2, var3, mean, na.rm=T)
var1[is.na(var1)] <- means[match(var3[is.na(var1)], sort(unique(var3)))]
2
Roland On

You should do multiple imputation instead:

DF <- read.table(text = "var1    var2    var3
123.1    2.1    113
166.5    2.1    113
200.3    2.1    112
 NA      2.1    113
 NA      2.1     NA 
212.1    3.3    112 ", header = TRUE)

library(mice)
impDF <- mice(DF)
plot(impDF)
comDF <- complete(impDF, "long")

library(data.table)
setDT(comDF)
comDF[, setNames(c(lapply(.SD, mean), lapply(.SD, sd)), 
                 c(paste0("var", 1:3), paste0("SDvar", 1:3))), 
      by = .id, .SDcols = paste0("var", 1:3)]
#   .id   var1 var2  var3   SDvar1 SDvar2    SDvar3
#1:   1 123.10  2.1 113.0  0.00000      0 0.0000000
#2:   2 166.50  2.1 113.0  0.00000      0 0.0000000
#3:   3 200.30  2.1 112.0  0.00000      0 0.0000000
#4:   4 173.70  2.1 113.0 34.80718      0 0.0000000
#5:   5 174.14  2.1 112.2 46.84131      0 0.4472136
#6:   6 212.10  3.3 112.0  0.00000      0 0.0000000

Obviously this works better with more data and other imputation methods might yield better results. Study the documentation.