what is the appropriate method to cluster binary matrix

2.7k views Asked by At

I am a beginner in clustering, and I have a binary matrix in which each student have the sessions they are enrolled in. I want to cluster students with same sessions.

clustering methods are so many and varies according to the dataset

for exemple k-means is not appropriate, because the data is binary and the standard "mean" operation does not make much sense for binary.

i'm open to any suggestion

Here's an example:

+------------+---------+--------+--------+
|  session1  | session2|session3|session4|
+------------+---------+--------+--------+
|     1      |    0    |   1    |    0   |
|     0      |    1    |   0    |    1   |
|     1      |    0    |   1    |    0   | 
|     0      |    1    |   0    |    1   |
+------------+---------+--------+--------+

Result:

clusterA = [user1,user3]

clusterB = [user2,user4]

1

There are 1 answers

2
knb On BEST ANSWER

You could use the Jaccard distance for each pair of points.

In R:

# create data table
mat = data.frame(s1 = c(T,F,T,F), s2 = c(F,T,F,T), 
                 s3 = c(T,F,T,F), s4 = c(F,T,F,T))

Result:

     s1    s2    s3    s4
1  TRUE FALSE  TRUE FALSE
2 FALSE  TRUE FALSE  TRUE
3  TRUE FALSE  TRUE FALSE
4 FALSE  TRUE FALSE  TRUE

 dist(mat, method="binary") # jaccard distance

Result:

  1 2 3
2 1    
3 0 1  
4 1 0 1

Row 3 has a distance of 1 from row 4. By chance, the distances are all exactly 1 and 0 here. These are actually floats. (Your toy dataset may be too simplistic here)

Cluster them:

hclust(dist(mat, method="binary"))

Result (no so informative):

Call:
hclust(d = dist(mat, method = "binary"))

Cluster method   : complete 
Distance         : binary 
Number of objects: 4 

Create dendrogram plot

plot(hclust(dist(mat, method="binary")))

dendrogram