How to Test Missing At Random (MAR) in R

939 views Asked by At

How do I test the assumption Missing At Random (MAR) in R?

Below are example data with code for testing completely missing at random (CMAR), and as well as imputation of missing data (which however assume MAR).

# Example Questionnaire data
set.seed(1)
Q1 <- sample(c(1:10, NA), 100, replace = TRUE)
Q2 <- sample(c(1:10, NA), 100, replace = TRUE)
Q3 <- sample(c(1:10, NA), 100, replace = TRUE)
Q4 <- sample(c(1:10, NA), 100, replace = TRUE)

questionniare <- tibble(Q1, Q2, Q3, Q4)

# Test Completely Missing at Random
naniar::mcar_test(questionniare)

# Plot number of missing values across variables
naniar::gg_miss_var(questionniare)

### How do I test for Missing at Random (MAR)?




# (Impute Missing values)
questionniare_imp <- mice::complete(mice::mice(questionniare), action="long")
2

There are 2 answers

0
Anay On BEST ANSWER

Try Logistic Regression: With respect to testing of MAR, we can create a binary variable that represents whether the data is missing(1) or not missing(0).

Modify sample code as per your requirements

# Example Questionnaire data
set.seed(1)
Q1 <- sample(c(1:10, NA), 100, replace = TRUE)
Q2 <- sample(c(1:10, NA), 100, replace = TRUE)
Q3 <- sample(c(1:10, NA), 100, replace = TRUE)
Q4 <- sample(c(1:10, NA), 100, replace = TRUE)

questionnaire <- data.frame(Q1, Q2, Q3, Q4)

# Create a binary variable indicating missing data
questionnaire$missing <- ifelse(rowSums(is.na(questionnaire)) > 0, 1, 0)

# Perform logistic regression
model <- glm(missing ~ Q1 + Q2 + Q3 + Q4, data = questionnaire, family = "binomial")

# summary(model)

# It gives us z-values. For calculating p-values from z-values : 

# Calculate the p-values from the z-values for each predictor variable
coefficients <- summary(model)$coefficients[, "Estimate"]
standard_errors <- summary(model)$coefficients[, "Std. Error"]
z_values <- coefficients / standard_errors
p_values <- 2 * (1 - pnorm(abs(z_values)))

# Compare the p-values with the significance level (e.g., 0.05)
significance_level <- 0.05
is_mar <- all(p_values > significance_level)

# Print the p-values and conclusion
print("P-Values:")
print(p_values)

print("Conclusion:")
if (is_mar) {
  print("The data is Missing at Random (MAR).")
} else {
  print("The data is not Missing at Random (not MAR).")
}

OUTPUT

[1] "P-Values:"
(Intercept)          Q1          Q2          Q3          Q4 
  0.9998727   1.0000000   1.0000000   1.0000000   1.0000000 
[1] "Conclusion:"
[1] "The data is Missing at Random (MAR)."

Hope this helps !

0
Alex On

If you're just looking for a quick eyeball test, you could simply draw a heat map of each variable's presence/absence against each other variable. For example:

library(tidyverse)

# Example Questionnaire data
# All uncorrelated, except for Q4 NAs against Q1 values
set.seed(1)
Q1 <- sample(c(1:10, NA), 100, replace = TRUE)
Q2 <- sample(c(1:10, NA), 100, replace = TRUE)
Q3 <- sample(c(1:10, NA), 100, replace = TRUE)
Q4 <- sample(1:10, 100, replace = TRUE) %>% {ifelse(Q1>5,.,NA)}

questionnaire <- tibble(Q1, Q2, Q3, Q4)

questionnaire %>% 
  # for each existing column, add a logical column indicating presence of NA
  {mutate(., across(colnames(.) %>% {setNames(., paste0(.,"_na"))}, ~ as.integer(is.na(.x))))} %>%
  # generate plot of pairwise correlations
  DataExplorer::plot_correlation(type="continuous", cor_args=list(use="pairwise.complete.obs"))

Here I've replaced your Q4 with a variable that will be NA wherever Q1 is 6+.

Heat map

Just looking at this heat map shows immediately that the data is very much not MAR: NAs in field Q4 are very highly correlated (-85%) with the values in field Q1. Obviously that's no substitute for a full statistical test, but depending on your use case (data exploration vs formal analysis) it might be good enough.

A more formal approach, taken from the DataCamp course that LC-scientist mentions, would be to fit a logistic regression model for each field's NA-indicator variable against each of the other fields in turn, then look at the p-value. This has the obvious issue that it will only capture linear correlations - if variable A's NAs are conditional on variable B squared then you're out of luck - and the slightly less obvious issue that the p-value threshold needs to be adjusted to avoid replicating an XKCD cartoon.

(I'm also slightly wary of considering this a true formal test. My gut feel is that you really need a proper joint test rather than merely repeating pairwise tests, although I'm not aware of one. For this post, I'll assume that the methodology is basically valid.)

The p-value issue can be fixed using a Šidák correction: replacing the threshold α with 1-(1-α)1/#tests. Or conversely - and what I'll do here - replacing the calculated p-value with 1-(1-p)#tests, so the viewer can apply whatever threshold they like. For N variables there will be N(N-1) tests.

Code for this - appended to the bottom of the above - is as follows:

# Get the p-value of logistic-regressing Vdep against Vreg
get_p <- function(data,Vreg,Vdep)
  # Usage: get_p(df,"Q1","Q2_na") to test the dependency of Q2_na on the regressor Q1
  glm(formula=as.formula(paste0(Vdep," ~ ",Vreg)), 
      family="binomial", data=data, na.action="na.omit") %>% 
    { coef(summary(.))[2,4] } %>% # warning: summary.glm returns 95% for a 5% p-value
    return()

# Apply Sidak adjustment to p-value
# Expressed as 5% = significant, per the Wikipedia page
adj_sidak <- function(pval, Nvars) return( 1-(1-pval) ^ (Nvars*(Nvars-1)) )

questionnaire %>% 
  # for each existing column, add a logical column indicating presence of NA
  {mutate(., across(colnames(.) %>% {setNames(., paste0(.,"_na"))}, ~ as.integer(is.na(.x))))} %>%
  { suppressWarnings( {
    ddd <- .

    ddd %>%
      # extract original column names
      colnames() %>% str_subset(pattern="^((?!_na).)+$") %>% 
      # create dataframe of non-equal column name pairs
      {expand.grid(.,.)} %>% filter(Var1!=Var2) %>%
      # for each pair, calculate the p-value from regressing 2nd onto 1st
      # (before and after Sidak correction)
      mutate(pval = mapply(get_p, 
                           .$Var1, Vdep=paste0(.$Var2,"_na"), 
                           MoreArgs=list(data=ddd)),
             pval_adj = 1-adj_sidak(1-pval, ncol(ddd)/2*(ncol(ddd)/2-1)))
  } ) } %>%
  # convert results to matrix-style table
  select(-pval) %>% pivot_wider(names_from=Var2, values_from=pval_adj) %>%
  # clean up formatting
  arrange(Var1) %>% column_to_rownames("Var1") %>% mutate_all(~ sprintf("%.2f%%",.x*100))

And the result looks something like this:

        Q1    Q2    Q3     Q4
Q1     NA% 0.00% 0.00% 78.75%
Q2   0.00%   NA% 0.00%  2.54%
Q3   0.00% 0.00%   NA%  0.00%
Q4 100.00% 0.00% 0.00%    NA%

Note that only the Q1-Q4 connection has a p-value less than 5% (expressed by R as 1-p=100.00%), and only in the direction from Q1 to Q4.

Going the other way round, there is visibly (78.75%) some kind of connection - the "backwash" from conditioning Q4's NAs entirely on Q1 - and we would have seen a spuriously significant p-value (99.82%!) if we hadn't applied the Šidák correction. The small Q4-Q2 link (2.54%) is just random chance, but again would have looked a lot more significant (97.26%) without the correction.