Manually program EM in r to update multiple parameters and solve missing data

95 views Asked by At

I am trying to use EM (Expectation-maximization) to fill in missing data in R, but am not sure how to model/code it for my specific case. I am generally trying to follow the example format used in this post (https://stats.stackexchange.com/questions/564630/how-does-expectation-maximization-compute-missing-data), with the following structure of the EM process:

library("MASS")
set.seed(123)
n     <- 1000
theta <- 20
beta0 <- 3
beta1 <- 0.2
x     <- runif(n=n, min=1, max=10)
mu    <- exp(beta0 + beta1*x)
y     <- rnegbin(n=n, mu=mu, theta=theta)
idxm  <- which(y %in% sample(y, 100))
y0    <- y[-idxm]
x0    <- x[-idxm]
logy0 <- log(y0)

# initialize parameter values
betaInit  <- unname(lm(logy0 ~ x0)$coefficients)
mu0       <- exp(betaInit[1] + betaInit[2]*x0)
thetaInit <- mean(mu0^2/(mean((mu0 - y0)^2) - mu0))
params1   <- c(thetaInit, betaInit)
params0   <- numeric(3)

# initialize unobserved values
y[idxm] <- as.integer(exp(params1[2] + params1[3]*x[idxm]))

# negative log-likelihood
fNLL <- function(params) {
  -sum(dnbinom(y, params[1], mu = exp(params[2] + params[3]*x), log = TRUE))
}

# iteratively find MLE on imputed data and then re-impute data
while (max(abs(params1 - params0)) > 1e-9) {
  params1 <- optim(params0 <- params1, fNLL, lower = 0, method = "L-BFGS-B")$par
  y[idxm] <- as.integer(exp(params1[2] + params1[3]*x[idxm]))
}

However, in my case, my data is a bit different. I am trying to fill in missingness in 2 variables (x1 and x2) that will be predictors (betas) in my regression. I do not know the parameters of the complete data for x1 and x2, but I do have complete data for x3, x4, and x5, all additional predictive variables in the regression model I will ultimately run with the complete data. My ultimate regression is very simple and looks like the following:

model = lm(data, y ~ x1 + x2 + x3 + x4 + x5)

How can I iteratively update both x1 and x2 using EM to fill in their missingness, similar to the above example? Do I need to use information from x3 x4 and x5 (like multiple imputation?) or can I use a simple EM algorithm with the above and update x1 and x2 simultaneously? In my case, I do not have complete data (as this example does)

UPDATE

I have worked on the code and am a bit closer, but still am not sure how to program this. My code currently looks like the following:


library(tidyverse)

# Simulate data with missing values for two variables
n <- 100
prop_missing <- 0.2
observed_data <- data.frame(
  x1 = c(rnorm(n * (1 - prop_missing)), rep(NA, n * prop_missing)),
  x2 = c(rnorm(n * (1 - prop_missing)), rep(NA, n * prop_missing))
)



# Initial parameters
initial_mean_x1 <- mean(observed_data$x1,na.rm = T)
initial_sd_x1 <- sd(observed_data$x1,na.rm = T)
initial_mean_x2 <- mean(observed_data$x2,na.rm = T)
initial_sd_x2 <- sd(observed_data$x2,na.rm = T)

#Missing value row IDs

missing_id_x1 = observed_data %>% dplyr::select(x1) %>% rowid_to_column() %>% filter(is.na(x1)) %>% dplyr::select(rowid) %>% pull()
missing_id_x2 = observed_data %>% dplyr::select(x2) %>% rowid_to_column() %>% filter(is.na(x2)) %>% dplyr::select(rowid) %>% pull()

# E-step: Impute missing values based on the current parameters
imputed_data <- observed_data
imputed_data$x1[missing_id_x1] <- rnorm(length(missing_id_x1),mean = initial_mean_x1, sd = initial_sd_x1)
imputed_data$x2[missing_id_x2] <- rnorm(length(missing_id_x2),mean = initial_mean_x2, sd = initial_sd_x2)

# M-step: Update parameters by maximizing the log likelihood using optim
log_likelihood <- function(params, data, variable) {
  mean_val <- params[1]
  sd_val <- params[2]
  log_likelihood <- sum(log(dnorm(data[!is.na(data)], mean_val, sd_val)))
  return(-log_likelihood)  # Minimize the negative log likelihood
}

# Update for x1
result_x1 <- optim(c(initial_mean_x1, initial_sd_x1), log_likelihood, data = observed_data$x1, variable = "x1")

# Extract the updated parameters for x1
updated_mean_x1 <- result_x1$par[1]
updated_sd_x1 <- result_x1$par[2]

# Update for x2
result_x2 <- optim(c(initial_mean_x2, initial_sd_x2), log_likelihood, data = observed_data$x2, variable = "x2")

# Extract the updated parameters for x2
updated_mean_x2 <- result_x2$par[1]
updated_sd_x2 <- result_x2$par[2]

# Output the updated parameters
paste0("Updated Parameters for x1")
paste0("Mean:", updated_mean_x1)
paste0("SD:", updated_sd_x1)

paste0("Updated Parameters for x2")
paste0("Mean:", updated_mean_x2)
paste0("SD:", updated_sd_x2)

# Output the imputed data for one iteration
paste0("Imputed Data:")
paste0(imputed_data)

As I understand, I do not need to use the other variables in the EM algorithm to impute the missing data, so I only include variables x1 and x2 in this sample data.

I extract the parameters for x1 and x2, and sample data to fill in the missingness for each. This I understand to be the E-step.

The M-step is where I am stuck. I know that I am supposed to maximize the logged likelihood to update the parameters and move toward more accurate sample draws, until I reach a tolerance threshold. But this is the tolerance for what, exactly? Is it the change in the logged likelihood, or the parameters?

If so, how do I iterate this? Do I update the same values again with the new optimized parameters and keep going until the tolerance is reached?

Update 2

I have found a code structure for updating a single variable that seems to acheive what I want to. The below is what I need to do to update values:

em.norm <- function(Y,mit,sit)
{Yobs <- y[!is.na(Y)] #Extract the observed values
Ymis <- Y[is.na(Y)] #Extract the missing values
n <- length(c(Yobs,Ymis)) #Length of total observations
r <- length(Yobs) #length of available observations

#Initial values
mut <- mit
sit <- sit

#Defin log-likelihood function
ll = function(y,mu,sigma2,n)
{
-0.5*n*log(2*pi*sigma2) - 0.5*sum((y-mu)^2)/sigma2
}

#comput log-likelihood for initial values
#and ignoring missing data mechanism

lltm1 <- ll(Yobs,mut,sit,n)
repeat{
  
#E-step
EY <- sum(Yobs) + (n-r)*mut
EY2 <- sum(Yobs^2) + (n-r)*(mut^2 + sit)
#M-step
mut1 <- EY/n
sit1 <- EY2/n - mut1^2

#update parameter values

mut <- mut1
sit <- sit1

#Compute log-likelihood using current estimates
#and ignoring missing data 

llt <- ll(Yobs,mut,sit,n)

#Print current parameter values
cat(mut,sit,llt,"\n")

#Stop if converged 
if (abs(lltm1 - llt) < 0.001) break
lltm1 <- llt
}

#fill in missing values with new mu #

return(c(mut,sit))
}
#function em norm ends here

x <- rnorm(20,5)
x[16:20] <- NA

xobs <- x[!is.na(x)]
xmis <- x[is.na(x)]
r <- length(xobs)
mit <- mean(xobs)
sit <- var(xobs)*(r-1)/r
em.norm(x,mit,sit)

However this does not seem to show the resultant completed dataset. How can I access the actual filled data, and how can I modify this to update to variables at once, or would I need to run this function separately for another variable and rejoin the datasets together?

0

There are 0 answers