How to find best resemblance between 1 row and the rest of dataframe in R?

253 views Asked by At

How can I find the best resemblance between one particular row and the rest of the rows in a dataframe?

I try to explain what I mean. Take a look at this dataframe:

df <- structure(list(person = 1:5, var1 = c(1L, 5L, 2L, 2L, 5L), var2 = c(4L, 
4L, 3L, 2L, 2L), var3 = c(5L, 4L, 4L, 3L, 1L)), .Names = c("person", 
"var1", "var2", "var3"), class = "data.frame", row.names = c(NA, 
-5L))

How can I find the best resemblance between person 1 (row 1) and the rest of the rows (persons) in the data frame. The output should be something like: person 1 still in row 1 and the rest of the rows in order of best resemblance. The simmilarity algorithm I want to use is cosine or pearson. I tried to solve my problem with functions from the arules package, but it didn't match well with my needs.

Any ideas someone?

2

There are 2 answers

1
Sotos On BEST ANSWER

Another idea is to define the cosine function manually, and apply it on your data frame, i.e.

f1 <- function(x, y){
  crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
}

df[c(1, order(sapply(2:nrow(df), function(i) 
                                f1(unlist(df[1,-1]), unlist(df[i, -1]))), 
                                                          decreasing = TRUE)+1),]

which gives,

   person var1 var2 var3
1      1    1    4    5
3      3    2    3    4
4      4    2    2    3
2      2    5    4    4
5      5    5    2    1
3
LyzandeR On

You could try cosine from lsa:

library('lsa') 
cosine(t(df[-1]))
#          [,1]      [,2]      [,3]      [,4]      [,5]
#[1,] 1.0000000 0.8379571 0.9742160 0.9356015 0.5070926
#[2,] 0.8379571 1.0000000 0.9346460 0.9637388 0.8947540
#[3,] 0.9742160 0.9346460 1.0000000 0.9908302 0.6780635
#[4,] 0.9356015 0.9637388 0.9908302 1.0000000 0.7527727
#[5,] 0.5070926 0.8947540 0.6780635 0.7527727 1.0000000

You provide cosine with a matrix where each column represents a person (that's why I use t) and it calculates all the cosine similarities among them.