how to impute missing values that are conditional on other values in the data set in R with MICE

101 views Asked by At

I have a dataset consisting of 2 continuous variables X1, X2 with missing values in both, and I need to impute the missing data. I am working with the MICE package in R. The trouble is that the values in one column are conditional on the other, specifically X1 >= X2. However, when I run mice, values are imputed that violate this condition.

Here is a minimal working example:

library(MASS)
library(tidyverse)
library(mice)

p1 <- 0.7
p2 <- 0.65

sample_size <- 100                                       
sample_meanvector <- c(5, 5)                                   
sample_covariance_matrix <- matrix(c(10, 5, 2, 9), ncol = 2)
  
mvrnorm(
        n = sample_size,
        mu = sample_meanvector, 
        Sigma = sample_covariance_matrix) %>%
    data.frame() %>%
    as_tibble() %>%
    mutate(R1 = rbinom(sample_size, 1, p1)) %>%
    mutate(R2 = rbinom(sample_size, 1, p2)) %>%
    mutate(X1 = ifelse(R1 == 1, X1, NA)) %>%
    mutate(X2 = ifelse(R2 == 1, X2, NA)) %>%
    dplyr::select(X1, X2) %>%
    filter(X1 >= X2 | is.na(X1) | is.na(X2)) -> sample_data

sample_data %>% 
    ggplot(aes(x=X1,y=X2)) + 
        geom_point() + 
        geom_abline(slope = 1, intercept = 0, color = 'red')

unimputed data scatter plot

mice(sample_data, m=1) -> mids

complete(mids, 1) -> imputed_data

imputed_data %>%
    ggplot(aes(x=X1,y=X2)) + 
        geom_point() + 
        geom_abline(slope = 1, intercept = 0, color = 'red')

imputed data scatter plot

I understand that I need to use the post feature somehow but I cannot find detailed enough documentation on this feature, specifically to help in the situation where the imputed values are constrained by other imputed values in the same dataset. Please help.

1

There are 1 answers

0
hanne On

The easiest solution to your problem is to use a different R package: smcfcs. For example:

library(smcfcs)
data <- pop
data[sample(nrow(data), size = 100), "wgt"] <- NA
data[sample(nrow(data), size = 100), "hgt"] <- NA
data$whr <- 100 * data$wgt / data$hgt
meth <- c("", "norm", "norm", "", "", "norm")
imps <- smcfcs(originaldata = data, meth = meth, smtype = "lm",
               smformula = "hc ~ age + hgt + wgt + whr")
fit <- lapply(imps$impDatasets, lm,
              formula = hc ~ age + hgt + wgt + whr)
summary(pool(fit))

If you do want to use mice, what is the specific conditioning that you need? The conditional imputation example in FIMD squeezes the imputed values within a certain range as follows:

library(mice)
data <- airquality[, 1:2]
post <- make.post(data)
post["Ozone"] <-
  "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(1, 200))"
imp <- mice(data, method = "norm.nob", m = 1,
            maxit = 1, seed = 1, post = post)

Otherwise, take a look at the mice postprocessing vignette or this answer.