Assume that we have the following data
set.seed(123)
Xmat <- matrix(rnorm(2000), ncol = 200, nrow = 10)
Ymat <- Xmat[,1:6]
Where Ymat is subsampled from Xmat.
Running protest()
we get a high correlation between the two matrices
library(vegan)
pss <- protest(Xmat, Ymat, permutations=0)
pss
Procrustes Sum of Squares (m12 squared): 0.4318
Correlation in a symmetric Procrustes rotation: 0.7538
Significance: 1
Permutation: free
Number of permutations: 0
How come, considering the large difference in the number of variables, the correlation is so high? Is the fact that Ymat is subsampled from Xmat has such a strong influence on the significance of the correlation?
This is something I found with my own data using LaSEC()
from the package lambda
. The aim of the function is to assess "how many landmarks are enough to characterize shape and size variation", as per the author's paper.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0198341
I found that with my own 2D landmark data, centroid size correlation can be as high as 98% when the centroid size of the subsampled data set has been derived from only 3 landmarks, compared to the 100 landmarks of the parent data set. I was also surprised to see that 20 landmarks out of 100 had a correlation of 90% with the original 100 landmarks!