I have an environmental data set consisting of continuous, non-normally distributed observations. My goal is to construct a latent variable from the measured 5 variables. The theory behind this construct seems sound, but I’m stuck with getting the idea formalized.
The 5 variables are strongly correlated (bivariate correlation .75-.95), and as I understand, this is problem for structural equation modeling? I’ve tried SEM with the ‘lavaan’ package in R, but I’m getting nowhere. So should I stick with SEM and try to iterate the model, or should I use some other approach?
Really more of a statistics question than an R question, but nevertheless...
Consider principal components analysis, which transforms a set of correlated variables into a new set of uncorrelated (orthogonal) variables (the principal components, PC). It is usually the case that a small number of PC's explain nearly all the variability in the original dataset. Using the built-in
iris
dataset in R:Produces this:
So PC1, the first principal component, explains 73% of the variation in the dataset, the first two (PC1 and PC2) together explain 96% of the variation.
Edit: Responding to @erska's comment/question below:
Produces this:
Which shows that
PC1
is highly correlated toSepal.Length
,Petal.Length
, andPetal.Width
, and moderately negatively correlated withSepal.Width
.PC4
is not highly correlated with anything, which is not surprising since it is composed of mostly random variation. This is a typical pattern in PCA.I think there might be a misunderstanding of the way PCA works. If you have, say,
n
variables in your original dataset, PCA by definition will identifyn
principal components, ordered by the fraction of variability explained (so, PC1 explains the most variability, etc.). You can tell the algorithm how many to report (e.g., just report PC1, or PC1 and PC2, etc.), but the calculation always producesn
PC's.