I am writing a function that performs PCA on pairs of variables in an xts object until the correlation between all of the variables is less than 0.1. Here is the function that I wrote:
PCA_Selection <- function(X, r=0.1){
M <- cor(X) # Creating corrolation matrix
M[M==1] <- 0 # filling the diagnal with 0s so that pairs of the same variables are not considered
while(max(abs(M)) > r){
M <- cor(X)
PCA_vars <- matrix(,nrow = (nrow(M))^2 ,ncol = 2)
for(i in 1:ncol(M)){ # Selects variables that will be use for PCA
for(j in 1:nrow(M)){
if(M[j,i] > r & M[j,i] < 1){
PCA_vars[c(i*j),] <- c(row.names(M)[i],colnames(M)[j])
}}} # works
PCA_vars <- na.omit(PCA_vars) # works
for (i in 1:nrow(PCA_vars)) {
PCA_pre <- prcomp(X[,c(names(X) %in% PCA_vars[i,])])
Sum_PCA <- summary(PCA_pre)
tmp <- data.frame()
if (Sum_PCA[["importance"]][2,1] > 0.95){ # if the first component captures 95% of variance
tmp <- data.frame(predict(PCA_pre, X)[,1]) # then only use the first component for predictions
names(tmp) <- c(paste0("Com_",PCA_vars[i,1],"_",PCA_vars[i,2],"_1"))
}else { # else use all both of the component and do not reduce the dimensions
tmp <- predict(PCA_pre,X)
colnames(tmp) <- c(paste0("Com_",PCA_vars[i,1],"_",PCA_vars[i,2],"_1"),
paste0("Com_",PCA_vars[i,1],"_",PCA_vars[i,2],"_2"))
}
Xnew <- cbind(X,tmp)
X <- Xnew
}
PCA_vars <- unique(as.vector(PCA_vars)) # Variables to be removed
X <- X[, -which(colnames(X) %in% PCA_vars)]
M <- cor(X)
M[M==1] <- 0
}
return(Xnew)
}
However, when I run the function r returns a strange error:
Error in colMeans(x, na.rm = TRUE): 'x' must be numeric
The data that I am testing the function with is an xts object that does not have any missing observations. Furthermore, all of the variables have non-zero variance and there are only continuous numeric variables in the data.
The error occurs at line 15:
PCA_pre <- prcomp(X[,c(names(X) %in% PCA_vars[i,])])
Actually, this works on the first run, when i=1. But it fails on the second run when i=2 for the following reason.
On line 27 you modify the
X
by assigning it toXnew
:which is created on line 26:
which I can't quite get my head around. Anyway,
tmp
is assigned on line 19 (if the principal component captures > 0.95 of the total variance) or on line 22 (if it doesn't).This also befuddles me because on line 19
tmp
will have a "data.frame" class while on line 22 it will have class "matrix". This is important later when you create theXnew
object on line 26 (see above). Iftmp
is a data frame, thenXnew
will be a "matrix", which has no names attribute:And this is why you get an error on line 15 (see above); the
prcomp
function is attempting to run a PCA on an empty set.I think the solution may be to not use the data.frame() function on line 19.
I tested this on a sample "xts" dataset but it seems to run forever. But at least there is no error.
And as an aside, line 17 could be omitted as it doesn't seem to do anything.