I'm currently estimating a hierarchical linear model (HLM) using lme4. My entire dataset has 367 observations. lme4 estimated my model using 341 observations - I assumed some were dropped due to missing data. However, when I sum the complete cases on the model's variables, I end up with 337 observations. This is making it difficult to test for assumptions when the model is a different length than the dataset.
There is a discrepancy between complete cases and the observations 'used' by lme4.
- Why would lme4 use 4 non-complete cases?
- How would I find out what exact observations (as ID #s) are being used by lme4?
As described, I tried to remove missing data from my main dataset assuming lme4 drops cases listwise. I've tried checking each variable for its missingness to see if lme4 was just testing a certain variable, but none match up with lme4's output estimate of 341.
If needed, I can provide the anonymized dataset - but hoping there's something easy I'm not aware of!
The most obvious reason would be that
lme4(and internallymodel.frame) assesses completeness only on the basis of the variables that are actually used in the model. Do you haveNAvalues in variables that are not included in the model formula?(For what it's worth, this default also means that it's a good idea to filter the full data set for complete cases first if you are going to fit a series of models with different subsets of predictors and want the models to be comparable ...)
Example:
Which rows were deleted? Two ways to check:
The equivalent to the internal code is
(the formula gets processed so that the random effects grouping variables are also included in the formula for the purpose of excluding incomplete cases)