vegan protest() finds a high correlation between significantly different data sets?

192 views Asked by At

Assume that we have the following data

set.seed(123) 
Xmat <- matrix(rnorm(2000), ncol = 200, nrow = 10) 
Ymat <- Xmat[,1:6]

Where Ymat is subsampled from Xmat.

Running protest() we get a high correlation between the two matrices

library(vegan)
pss <- protest(Xmat, Ymat, permutations=0)
pss


Procrustes Sum of Squares (m12 squared):        0.4318 
Correlation in a symmetric Procrustes rotation: 0.7538 
Significance:  1 

Permutation: free
Number of permutations: 0

How come, considering the large difference in the number of variables, the correlation is so high? Is the fact that Ymat is subsampled from Xmat has such a strong influence on the significance of the correlation?

This is something I found with my own data using LaSEC() from the package lambda. The aim of the function is to assess "how many landmarks are enough to characterize shape and size variation", as per the author's paper.

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0198341

I found that with my own 2D landmark data, centroid size correlation can be as high as 98% when the centroid size of the subsampled data set has been derived from only 3 landmarks, compared to the 100 landmarks of the parent data set. I was also surprised to see that 20 landmarks out of 100 had a correlation of 90% with the original 100 landmarks!

0

There are 0 answers