I am trying to use EM (Expectation-maximization) to fill in missing data in R, but am not sure how to model/code it for my specific case. I am generally trying to follow the example format used in this post (https://stats.stackexchange.com/questions/564630/how-does-expectation-maximization-compute-missing-data), with the following structure of the EM process:
library("MASS")
set.seed(123)
n <- 1000
theta <- 20
beta0 <- 3
beta1 <- 0.2
x <- runif(n=n, min=1, max=10)
mu <- exp(beta0 + beta1*x)
y <- rnegbin(n=n, mu=mu, theta=theta)
idxm <- which(y %in% sample(y, 100))
y0 <- y[-idxm]
x0 <- x[-idxm]
logy0 <- log(y0)
# initialize parameter values
betaInit <- unname(lm(logy0 ~ x0)$coefficients)
mu0 <- exp(betaInit[1] + betaInit[2]*x0)
thetaInit <- mean(mu0^2/(mean((mu0 - y0)^2) - mu0))
params1 <- c(thetaInit, betaInit)
params0 <- numeric(3)
# initialize unobserved values
y[idxm] <- as.integer(exp(params1[2] + params1[3]*x[idxm]))
# negative log-likelihood
fNLL <- function(params) {
-sum(dnbinom(y, params[1], mu = exp(params[2] + params[3]*x), log = TRUE))
}
# iteratively find MLE on imputed data and then re-impute data
while (max(abs(params1 - params0)) > 1e-9) {
params1 <- optim(params0 <- params1, fNLL, lower = 0, method = "L-BFGS-B")$par
y[idxm] <- as.integer(exp(params1[2] + params1[3]*x[idxm]))
}
However, in my case, my data is a bit different. I am trying to fill in missingness in 2 variables (x1 and x2) that will be predictors (betas) in my regression. I do not know the parameters of the complete data for x1 and x2, but I do have complete data for x3, x4, and x5, all additional predictive variables in the regression model I will ultimately run with the complete data. My ultimate regression is very simple and looks like the following:
model = lm(data, y ~ x1 + x2 + x3 + x4 + x5)
How can I iteratively update both x1 and x2 using EM to fill in their missingness, similar to the above example? Do I need to use information from x3 x4 and x5 (like multiple imputation?) or can I use a simple EM algorithm with the above and update x1 and x2 simultaneously? In my case, I do not have complete data (as this example does)
UPDATE
I have worked on the code and am a bit closer, but still am not sure how to program this. My code currently looks like the following:
library(tidyverse)
# Simulate data with missing values for two variables
n <- 100
prop_missing <- 0.2
observed_data <- data.frame(
x1 = c(rnorm(n * (1 - prop_missing)), rep(NA, n * prop_missing)),
x2 = c(rnorm(n * (1 - prop_missing)), rep(NA, n * prop_missing))
)
# Initial parameters
initial_mean_x1 <- mean(observed_data$x1,na.rm = T)
initial_sd_x1 <- sd(observed_data$x1,na.rm = T)
initial_mean_x2 <- mean(observed_data$x2,na.rm = T)
initial_sd_x2 <- sd(observed_data$x2,na.rm = T)
#Missing value row IDs
missing_id_x1 = observed_data %>% dplyr::select(x1) %>% rowid_to_column() %>% filter(is.na(x1)) %>% dplyr::select(rowid) %>% pull()
missing_id_x2 = observed_data %>% dplyr::select(x2) %>% rowid_to_column() %>% filter(is.na(x2)) %>% dplyr::select(rowid) %>% pull()
# E-step: Impute missing values based on the current parameters
imputed_data <- observed_data
imputed_data$x1[missing_id_x1] <- rnorm(length(missing_id_x1),mean = initial_mean_x1, sd = initial_sd_x1)
imputed_data$x2[missing_id_x2] <- rnorm(length(missing_id_x2),mean = initial_mean_x2, sd = initial_sd_x2)
# M-step: Update parameters by maximizing the log likelihood using optim
log_likelihood <- function(params, data, variable) {
mean_val <- params[1]
sd_val <- params[2]
log_likelihood <- sum(log(dnorm(data[!is.na(data)], mean_val, sd_val)))
return(-log_likelihood) # Minimize the negative log likelihood
}
# Update for x1
result_x1 <- optim(c(initial_mean_x1, initial_sd_x1), log_likelihood, data = observed_data$x1, variable = "x1")
# Extract the updated parameters for x1
updated_mean_x1 <- result_x1$par[1]
updated_sd_x1 <- result_x1$par[2]
# Update for x2
result_x2 <- optim(c(initial_mean_x2, initial_sd_x2), log_likelihood, data = observed_data$x2, variable = "x2")
# Extract the updated parameters for x2
updated_mean_x2 <- result_x2$par[1]
updated_sd_x2 <- result_x2$par[2]
# Output the updated parameters
paste0("Updated Parameters for x1")
paste0("Mean:", updated_mean_x1)
paste0("SD:", updated_sd_x1)
paste0("Updated Parameters for x2")
paste0("Mean:", updated_mean_x2)
paste0("SD:", updated_sd_x2)
# Output the imputed data for one iteration
paste0("Imputed Data:")
paste0(imputed_data)
As I understand, I do not need to use the other variables in the EM algorithm to impute the missing data, so I only include variables x1 and x2 in this sample data.
I extract the parameters for x1 and x2, and sample data to fill in the missingness for each. This I understand to be the E-step.
The M-step is where I am stuck. I know that I am supposed to maximize the logged likelihood to update the parameters and move toward more accurate sample draws, until I reach a tolerance threshold. But this is the tolerance for what, exactly? Is it the change in the logged likelihood, or the parameters?
If so, how do I iterate this? Do I update the same values again with the new optimized parameters and keep going until the tolerance is reached?
Update 2
I have found a code structure for updating a single variable that seems to acheive what I want to. The below is what I need to do to update values:
em.norm <- function(Y,mit,sit)
{Yobs <- y[!is.na(Y)] #Extract the observed values
Ymis <- Y[is.na(Y)] #Extract the missing values
n <- length(c(Yobs,Ymis)) #Length of total observations
r <- length(Yobs) #length of available observations
#Initial values
mut <- mit
sit <- sit
#Defin log-likelihood function
ll = function(y,mu,sigma2,n)
{
-0.5*n*log(2*pi*sigma2) - 0.5*sum((y-mu)^2)/sigma2
}
#comput log-likelihood for initial values
#and ignoring missing data mechanism
lltm1 <- ll(Yobs,mut,sit,n)
repeat{
#E-step
EY <- sum(Yobs) + (n-r)*mut
EY2 <- sum(Yobs^2) + (n-r)*(mut^2 + sit)
#M-step
mut1 <- EY/n
sit1 <- EY2/n - mut1^2
#update parameter values
mut <- mut1
sit <- sit1
#Compute log-likelihood using current estimates
#and ignoring missing data
llt <- ll(Yobs,mut,sit,n)
#Print current parameter values
cat(mut,sit,llt,"\n")
#Stop if converged
if (abs(lltm1 - llt) < 0.001) break
lltm1 <- llt
}
#fill in missing values with new mu #
return(c(mut,sit))
}
#function em norm ends here
x <- rnorm(20,5)
x[16:20] <- NA
xobs <- x[!is.na(x)]
xmis <- x[is.na(x)]
r <- length(xobs)
mit <- mean(xobs)
sit <- var(xobs)*(r-1)/r
em.norm(x,mit,sit)
However this does not seem to show the resultant completed dataset. How can I access the actual filled data, and how can I modify this to update to variables at once, or would I need to run this function separately for another variable and rejoin the datasets together?