I have to perform a PCA on a high-dimensional dataset with the infrared spectra of different wines and then plot it in 2D. I have to color the red wines in red and the white wines in turquoise on the plot.
This is the code I came up with:
wine_pca <- prcomp(data[,-c(1:9)]) #eliminate columns 1-9 which contain other non-numeric information
pc <- predict(wine_pca)
pc1 <- predict(wine_pca)[,1]
pc2 <- predict(wine_pca)[,2]
#plot principal components pc1 & pc2
ggplot(pc, aes(PC1, PC2)) + theme_bw() +
geom_point(aes(shape = data$name, color = data$color), show.legend = TRUE, size = 3) +
scale_shape_manual(values = c(3, 4, 8, 21, 22, 23, 24, 25)) +
scale_color_manual(guide=FALSE, values=c("red", "turquoise")) +
theme(legend.position = 'right', legend.title = element_blank()) +
xlab("First Principal Component") +
ylab("Second Principal Component") +
ggtitle("First Two Principal Components of a Selection of Wines")
I thought it was looking and running pretty good, but the feedback I got from my professor was:
"Why did you rescale the data for pca? This does not make sense in this case (otherwise please explain) and leads to different results"
As I am a doofus, I don't really understand the feedback - where did I scale the data? Is my approach fundamentally wrong? I would be mighty grateful if one of you whiz kids could help a pretty hopeless girl out. Thanks!
As your data is not in the question you can try this. The basic idea of scaling is transforming all the variables in adimensional scale so that they can be compared. Try this code at the beginning and compare with your previous results: