I am a beginner in clustering, and I have a binary matrix in which each student have the sessions they are enrolled in. I want to cluster students with same sessions.
clustering methods are so many and varies according to the dataset
for exemple k-means is not appropriate, because the data is binary and the standard "mean" operation does not make much sense for binary.
i'm open to any suggestion
Here's an example:
+------------+---------+--------+--------+
| session1 | session2|session3|session4|
+------------+---------+--------+--------+
| 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
+------------+---------+--------+--------+
Result:
clusterA = [user1,user3]
clusterB = [user2,user4]
You could use the Jaccard distance for each pair of points.
In R:
Result:
Result:
Row 3 has a distance of 1 from row 4. By chance, the distances are all exactly 1 and 0 here. These are actually floats. (Your toy dataset may be too simplistic here)
Cluster them:
Result (no so informative):
Create dendrogram plot