mclapply() instead of nested for loops

79 views Asked by At

I have a huge matrix and would like to calculate the distance matrix using pairwise comparison of all columns. I know that built in dist() functioin in R already does the job but I would like to do this manually using nested for loops. I have written the following code which returns the matrix as desired.

The issue is that my real data is huge and nested for loops takes forever to compute the dist matrix.

mtx <- matrix(rnorm(20), nrow = 4, ncol = 5)
mtx.out <- matrix(0, nrow = 5,  ncol = 5)

for(i in 1:ncol(mtx)){
  for(j in 1:ncol(mtx)){
    mtx.out[i,j] <- sqrt(sum((mtx[[i]] - mtx[[j]]) ^ 2))
  }
}

I was wondering how I can use lapply or mclapply and specify the number of cores on my computer to make the process faster.

Any help is much appreciated!

1

There are 1 answers

1
dcarlson On

If you really need to compute the distances directly, this will be faster than the double loop, partly because it does not compute the distances twice (1 vs 2 and 2 vs 1) and it does not compute distances of a column with itself (1 vs 1):

set.seed(42)    $ Make the example reproducible
mtx <- matrix(rnorm(20), nrow = 4, ncol = 5)

dmat <- matrix(0, 5, 5)
dsts <- apply(combn(1:5, 2), 2, function(x) sqrt(sum((mtx[, x[1]] - mtx[, x[2]])^2)))
dmat[lower.tri(dmat)] <- dsts
dmat[upper.tri(dmat)] <- dsts
dmat
#          [,1]     [,2]     [,3]     [,4]     [,5]
# [1,] 0.000000 1.729993 2.071973 3.930348 4.959642
# [2,] 1.729993 0.000000 2.818655 2.884562 4.055794
# [3,] 2.071973 2.884562 0.000000 2.546453 5.195339
# [4,] 2.818655 2.546453 4.055794 0.000000 3.558716
# [5,] 3.930348 4.959642 5.195339 3.558716 0.000000
# 
dist(t(mtx), diag=TRUE, upper=TRUE)   # Using the dist() function
#          1        2        3        4        5
# 1 0.000000 1.729993 2.071973 2.818655 3.930348
# 2 1.729993 0.000000 2.884562 2.546453 4.959642
# 3 2.071973 2.884562 0.000000 4.055794 5.195339
# 4 2.818655 2.546453 4.055794 0.000000 3.558716
# 5 3.930348 4.959642 5.195339 3.558716 0.000000

In contrast, your code produces:

mtx.out
#           [,1]      [,2]       [,3]      [,4]       [,5]
# [1,] 0.0000000 1.9356566 1.00783004 0.7380958 0.96669012
# [2,] 1.9356566 0.0000000 0.92782658 1.1975608 0.96896649
# [3,] 1.0078300 0.9278266 0.00000000 0.2697342 0.04113991
# [4,] 0.7380958 1.1975608 0.26973419 0.0000000 0.22859428
# [5,] 0.9666901 0.9689665 0.04113991 0.2285943 0.00000000