Simple matching similarity matrix for continuous, non-binary data?

Question

Simple matching similarity matrix for continuous, non-binary data?

2k views Asked by Werner Hertzog At 24 May 2012 at 03:49

Given the matrix

structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L, 
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L, 
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1", 
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA, 
-6L))

I want to create a 5 x 5 distance matrix with the ratio of matches and the total number of rows between all columns. For instance, the distance between X4 and X3 should be 0.5, given that both columns match 3 out of 6 times.

I have tried using dist(test, method="simple matching") from package "proxy" but this method only works for binary data.

Original Q&A

There are 5 answers

Jesse Anderson On 24 May 2012 at 04:11

Here's a shot at it (dt is your matrix):

library(reshape)
df = expand.grid(names(dt),names(dt))
df$val=apply(df,1,function(x) mean(dt[x[1]]==dt[x[2]]))
cast(df,Var2~Var1)

Chase On 24 May 2012 at 04:35

Here's a solution that is faster than the other two, though a bit ugly. I assume the speed bumps come from not using mean() as it can be slow compared to sum(), and also only computing half of the output matrix and then filling the lower triangle manually. The function currently leaves NA on the diagonal, but you can easily set those to one to completely match the other answers with diag(out) <- 1

FUN <- function(m) {
  #compute all the combinations of columns pairs
  combos <- t(combn(ncol(m),2))
  #compute the similarity index based on the criteria defined
  sim <- apply(combos, 1, function(x) sum(m[, x[1]] - m[, x[2]] == 0) / nrow(m))
  combos <- cbind(combos, sim)
  #dimensions of output matrix
  out <- matrix(NA, ncol = ncol(m), nrow = ncol(m))

  for (i in 1:nrow(combos)){
    #upper tri
    out[combos[i, 1], combos[i, 2]] <- combos[i,3]
    #lower tri
    out[combos[i, 2], combos[i, 1]] <- combos[i,3]
  }
  return(out)
}

I took the other two answers, made them into functions, and did some benchmarking:

library(rbenchmark)
benchmark(chase(m), flodel(m), blindJessie(m), 
          replications = 1000,
          order = "elapsed", 
          columns = c("test", "elapsed", "relative"))
#-----
       test elapsed relative
1  chase(m)   1.217 1.000000
2 flodel(m)   1.306 1.073131
3 blindJessie(m)  17.691 14.548520

Werner Hertzog On 25 May 2012 at 02:49

Thank you all for your suggestions. Based on your answers I elaborated a three line solution ("test" is the name of the dataset).

require(proxy)
ff <- function(x,y) sum(x == y) / NROW(x)
dist(t(test), ff, upper=TRUE)

Here is the output:

          X1        X2        X3        X4        X5
X1           0.0000000 0.0000000 0.0000000 0.3333333
X2 0.0000000           0.5000000 0.5000000 0.1666667
X3 0.0000000 0.5000000           0.5000000 0.0000000
X4 0.0000000 0.5000000 0.5000000           0.0000000
X5 0.3333333 0.1666667 0.0000000 0.0000000

Dilnessa G. On 18 February 2017 at 14:38

I have got the answer as follows: 1st I have made some modifications on the row data as:

X1 = c(1L, 2L, 3L, 4L, 2L, 5L)
X2 = c(2L, 3L, 4L, 5L, 3L, 6L)
X3 = c(3L, 4L, 4L, 5L, 3L, 2L)
X4 = c(2L, 4L, 6L, 5L, 3L, 8L)
X5 = c(1L, 3L, 2L, 4L, 6L, 4L)
matrix_cor=rbind(x1,x2,x3,x4,x5)
matrix_cor

   [,1] [,2] [,3] [,4] [,5] [,6]
X1    1    2    3    4    2    5
X2    2    3    4    5    3    6
X3    3    4    4    5    3    2
X4    2    4    6    5    3    8
X5    1    3    2    4    6    4

then:

dist(matrix_cor)

     X1       X2       X3       X4
X2 2.449490                           
X3 4.472136 4.242641                  
X4 5.000000 3.000000 6.403124         
X5 4.358899 4.358899 4.795832 6.633250

**flodel** · Accepted Answer · 2012-05-24T04:16:42+00:00

Using outer (again :-)

my.dist <- function(x) {
 n <- nrow(x)
 d <- outer(seq.int(ncol(x)), seq.int(ncol(x)),
            Vectorize(function(i,j)sum(x[[i]] == x[[j]]) / n))
 rownames(d) <- names(x)
 colnames(d) <- names(x)
 return(d)
}

my.dist(x)
#           X1        X2  X3  X4        X5
# X1 1.0000000 0.0000000 0.0 0.0 0.3333333
# X2 0.0000000 1.0000000 0.5 0.5 0.1666667
# X3 0.0000000 0.5000000 1.0 0.5 0.0000000
# X4 0.0000000 0.5000000 0.5 1.0 0.0000000
# X5 0.3333333 0.1666667 0.0 0.0 1.0000000

TechQA.

Simple matching similarity matrix for continuous, non-binary data?

There are 5 answers

Related Questions in R

Related Questions in DISTANCE

Related Questions in MATCHING

Related Questions in SIMILARITY

Related Questions in METRIC

Popular Questions

Popular Tags

Trending Questions