Discrepancy between lme4 sample size and complete case dataset?

Question

Discrepancy between lme4 sample size and complete case dataset?

41 views Asked by Justin Research At 19 December 2023 at 20:23

I'm currently estimating a hierarchical linear model (HLM) using lme4. My entire dataset has 367 observations. lme4 estimated my model using 341 observations - I assumed some were dropped due to missing data. However, when I sum the complete cases on the model's variables, I end up with 337 observations. This is making it difficult to test for assumptions when the model is a different length than the dataset.

There is a discrepancy between complete cases and the observations 'used' by lme4.

Why would lme4 use 4 non-complete cases?
How would I find out what exact observations (as ID #s) are being used by lme4?

As described, I tried to remove missing data from my main dataset assuming lme4 drops cases listwise. I've tried checking each variable for its missingness to see if lme4 was just testing a certain variable, but none match up with lme4's output estimate of 341.

If needed, I can provide the anonymized dataset - but hoping there's something easy I'm not aware of!

Original Q&A

There are 1 answers

**Ben Bolker** · Accepted Answer · 2023-12-19T22:49:16+00:00

The most obvious reason would be that lme4 (and internally model.frame) assesses completeness only on the basis of the variables that are actually used in the model. Do you have NA values in variables that are not included in the model formula?

(For what it's worth, this default also means that it's a good idea to filter the full data set for complete cases first if you are going to fit a series of models with different subsets of predictors and want the models to be comparable ...)

Example:

library(lme4)
ss <- sleepstudy
ss$Days[c(1, 4, 10)] <- NA
ss$extra <- 1
ss$extra[c(4, 20, 25)] <- NA

nrow(ss)  ## 180
sum(complete.cases(ss))  ## 175; drops obs with missing `extra`
m <- lmer(Reaction ~ Days + (1|Subject), ss, na.action = na.omit)
mf <- model.frame(m)
nrow(mf)  ## 177; ignores variables not in the model

Which rows were deleted? Two ways to check:

attr(mf, "na.action")
 1  4 10 
 1  4 10 
attr(,"class")
[1] "omit"

setdiff(rownames(ss), rownames(mf))
[1] "1"  "4"  "10"

The equivalent to the internal code is

nrow(model.frame(Reaction ~ Days + Subject, ss))  ## 177

(the formula gets processed so that the random effects grouping variables are also included in the formula for the purpose of excluding incomplete cases)

TechQA.

Discrepancy between lme4 sample size and complete case dataset?

There are 1 answers

Related Questions in R

Related Questions in MISSING-DATA

Related Questions in LME4

Related Questions in MULTI-LEVEL

Popular Questions

Trending Questions