I am writing a function that performs PCA on pairs of variables in an xts object until the correlation between all of the variables is less than 0.1. Here is the function that I wrote:
PCA_Selection <- function(X, r=0.1){
M <- cor(X) # Creating corrolation matrix
M[M==1] <- 0 # filling the diagnal with 0s so that pairs of the same variables are not considered
while(max(abs(M)) > r){
M <- cor(X)
PCA_vars <- matrix(,nrow = (nrow(M))^2 ,ncol = 2)
for(i in 1:ncol(M)){ # Selects variables that will be use for PCA
for(j in 1:nrow(M)){
if(M[j,i] > r & M[j,i] < 1){
PCA_vars[c(i*j),] <- c(row.names(M)[i],colnames(M)[j])
}}} # works
PCA_vars <- na.omit(PCA_vars) # works
for (i in 1:nrow(PCA_vars)) {
PCA_pre <- prcomp(X[,c(names(X) %in% PCA_vars[i,])])
Sum_PCA <- summary(PCA_pre)
tmp <- data.frame()
if (Sum_PCA[["importance"]][2,1] > 0.95){ # if the first component captures 95% of variance
tmp <- data.frame(predict(PCA_pre, X)[,1]) # then only use the first component for predictions
names(tmp) <- c(paste0("Com_",PCA_vars[i,1],"_",PCA_vars[i,2],"_1"))
}else { # else use all both of the component and do not reduce the dimensions
tmp <- predict(PCA_pre,X)
colnames(tmp) <- c(paste0("Com_",PCA_vars[i,1],"_",PCA_vars[i,2],"_1"),
paste0("Com_",PCA_vars[i,1],"_",PCA_vars[i,2],"_2"))
}
Xnew <- cbind(X,tmp)
X <- Xnew
}
PCA_vars <- unique(as.vector(PCA_vars)) # Variables to be removed
X <- X[, -which(colnames(X) %in% PCA_vars)]
M <- cor(X)
M[M==1] <- 0
}
return(Xnew)
}
However, when I run the function r returns a strange error:
Error in colMeans(x, na.rm = TRUE): 'x' must be numeric
The data that I am testing the function with is an xts object that does not have any missing observations. Furthermore, all of the variables have non-zero variance and there are only continuous numeric variables in the data.
The error occurs at line 15:
PCA_pre <- prcomp(X[,c(names(X) %in% PCA_vars[i,])])Actually, this works on the first run, when i=1. But it fails on the second run when i=2 for the following reason.
On line 27 you modify the
Xby assigning it toXnew:which is created on line 26:
which I can't quite get my head around. Anyway,
tmpis assigned on line 19 (if the principal component captures > 0.95 of the total variance) or on line 22 (if it doesn't).This also befuddles me because on line 19
tmpwill have a "data.frame" class while on line 22 it will have class "matrix". This is important later when you create theXnewobject on line 26 (see above). Iftmpis a data frame, thenXnewwill be a "matrix", which has no names attribute:And this is why you get an error on line 15 (see above); the
prcompfunction is attempting to run a PCA on an empty set.I think the solution may be to not use the data.frame() function on line 19.
I tested this on a sample "xts" dataset but it seems to run forever. But at least there is no error.
And as an aside, line 17 could be omitted as it doesn't seem to do anything.