PCA - All variables with same signal on PC1 coordinates

304 views Asked by At

So, I am analyzing a dataset that consists of 160 observations and 20 variables and am performing a PCA. It is about patients affected by a disease and the variables are antibodies levels measured in the same experiment and the values are on the same units (u/mL). These variables are all positive values so I can't understand how I would have samples on the positive PC1 side of the plot without any variable contributing to that side (given that there are no negative values involved on these variables).

For confounding factors, what I have is: patients' age, gender and the duration of infection, but these 3 were not added in the PC analysis.

I am having some trouble to understand the following: when using the rpackage factoextra's function fviz_pca_biplot() to see both the sample distribution as well as each variable contribution to PCs 1 and 2, I realized that my 20 variables have high negative value for PC1.

For the following images, I generated them using a small sample of my original data and, eventhough the variables contribution are not the same, they are still highly negative for PC1. This is understandable if I do not center my data in the prcomp() function (image 1) as it is possible to see that all of my samples are on the negative side of the PC1 component and it explains most of the data inertia.

library(factoextra)

PCAf <- read.table("PCA_small_sample.csv", sep = ";", header = T, row.names = 1)
res.pca <- prcomp(PCAf, scale = TRUE, center = F)

fviz_pca_biplot(res.pca)

Not centered PCA

However, I have been taught that it is necessary to center the data when performing PCA and the image becomes like this:

res.pca <- prcomp(PCAf, scale = TRUE)

fviz_pca_biplot(res.pca)

centered PCA

This diminishes PC1 explained variance and increases PC2 but, eventhough it changes the variables coordinates, there is no positive coord to PC1.

res.var <- get_pca_var(res.pca)
res.var$coord

These are the values for the non centered PCA: non centered coords And for the centered PCA: enter image description here

Am I doing something wrong, should I really present my analysis with the second image eventhough the vectors do not match what we are seeing?

My main question is: When presenting the PCA, it is better to do so with the centralized data, right? Then, should I perform some sort of correction to the variables' coordinates/contribution to the PCs? Because this second image does not seem too reliable to me, but this may be due to lack of experience... I mean, since all variables are going toward the left side of the plot, what would be pulling some of the samples (e.g. 7,10,8,4,20) towards the right side of the plot (positive PC1)? It seems counterintuitive that there isn't even a single vector on the right side.

This also brings me the question: Should I add confounding factors when performing a PCA? I performed linear regression to account for them but did not include them in the PC analysis.

Anyway, thank you all so much in advance.

PS: I uploaded a file containing a sample of my data, code and images on github

PS2: When plotting this with a generic dataset, I do not see the same issue. At first it happens but when centering the data, there are vectors on the four quadrants, for which I am able to extract some rationale.

data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
  paste("wt", 1:5, sep=""),
  paste("ko", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
  wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
  ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
  
  data.matrix[i,] <- c(wt.values, ko.values)
}
PCAf <- t(data.matrix)

res.pca_NC <- prcomp(PCAf, scale = TRUE, center = F)
res.pca_C <- prcomp(PCAf, scale = TRUE, center = T)

fviz_pca_biplot(res.pca_NC)
fviz_pca_biplot(res.pca_C)

Not centered - generic PCA: enter image description here

Centered - generic PCA: enter image description here

0

There are 0 answers