Data Cleaning for Survival Analysis Using a Participant's Own Data to Impute Values

125 views Asked by At

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that missing data gets imputed based on the surrounding values within a given subject. I'd like to use the mean of the closest previous and closest subsequent values for the participant. If there is no subsequent value present, then I'd like to use the previous value carried forward until a subsequent value is present.

I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

The details are below:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(2,2,4,3,NA,0,0,1,4,0,NA,0,0,0,4,2,1,3,3,2,NA,3,4,3,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*Bold and underlined characters represent changes from the dataset above

The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 2,2,4,3,1.5,0,0

ID# 2 (variable ss) to look like this: 1,4,0,0,0,0,0

ID #3 (variable ss) to look like this: 4,2,1,3,3,2,NA (no change because the row with NA will be deleted eventually)

ID #4 (variable ss) to look like this: 3,4,3,3,1.5,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).

1

There are 1 answers

0
lukeA On BEST ANSWER

If processing speed is not the issue (I guess "ID #4" makes it hard to vectorize imputations), then maybe try:

f <- function(x) {
  idx <- which(is.na(x))
  for (id in idx) {
    sel <- x[id+c(-1,1)]
    if (id < length(x)) 
      sel <- sel[!is.na(sel)]
    x[id] <- mean(sel)
  }
  return(x)                 
}
cbind(mydat, ss_imp=ave(mydat$ss, mydat$id, FUN=f))
#    id time ss ss_imp
# 11  1    0  2    2.0
# 12  1    1  2    2.0
# 13  1    2  4    4.0
# 14  1    3  3    3.0
# 15  1    4 NA    1.5
# 16  1    5  0    0.0
# 17  1    6  0    0.0
# 21  2    0  1    1.0
# 22  2    1  4    4.0
# 23  2    2  0    0.0
# 24  2    3 NA    0.0
# 25  2    4  0    0.0
# 26  2    5  0    0.0
# 27  2    6  0    0.0
# 31  3    0  4    4.0
# 32  3    1  2    2.0
# 33  3    2  1    1.0
# 34  3    3  3    3.0
# 35  3    4  3    3.0
# 36  3    5  2    2.0
# 37  3    6 NA     NA
# 41  4    0  3    3.0
# 42  4    1  4    4.0
# 43  4    2  3    3.0
# 44  4    3 NA    3.0
# 45  4    4 NA    1.5
# 46  4    5  0    0.0
# 47  4    6  0    0.0