How to compare consistency between clustering results and list of values with different levels in R?

386 views Asked by At

I'm currently struggling for a problem that may seem easy to solve, and that was maybe answered in previous questions, but I can't find anything on the net about this problematic.

I'm currently doing a clustering analysis on some data (k-means, hierarchical thru heatmap but whatever). I want to check if my clustering ("Cluster group" column) is consistent with a list of values ("tropism" column) attached to my individuals. Thing is that this list of values, of course, doesn't have the same levels than my clustering results. I would like to make a kappa-fleiss consistency test on both variables (clustering results v.s. list of values). Here is a shortened version of my dataframe :

                Cluster group tropism    
JX308829.1 all  "1"           "digestif" 
NC_020890.1 all "1"           "digestif" 
KF954417.1 all  "1"           "peau"     
HM011544.1 all  "2"           "peau"      
MH844627.1 all  "2"           "peau"     
HQ696595.1 all  "2"           "rein"     
AB211390.1 all  "2"           "rein"     
AB301101.1 all  "2"           "rein"     
HM011559.1 all  "2"           "digestif" 
KY404016.1 all  "2"           "rein"      
KF444093.1 all  "3"           "cerveau"    
KJ725028.1 all  "3"           "peau"     
GU296408.1 all  "3"           "peau"     
EU711058.1 all  "3"           "syst_resp"
KC549591.1 all  "4"           "syst_resp"
KR090571.1 all  "4"           "muscle"   
AB081611.1 all  "5"           "muscle"   
AB092581.1 all  "5"           "peau"     
AB127351.2 all  "5"           "digestif"

Problem is that, naturally, kappa-fleiss consistency score compare two lists with the same levels.

I tried to create an algorithm where each cluster level is renamed following the majority of values in it, but of course it seems a little bit "manipulating the data", and I have some equalities between and inside of groups, making it hard to select values for my cluster groups. I then have multiple questions:

  1. Why couldn't I compare the consistency between two list of variables with different levels? It seems a little bit naive, but shouldn't consistency be measured between groups (like if "digestive" is correlated with "cluster group 1" or whatever)? Is there an option that I missed in the kappam.fleiss() function?
  2. Is there a function, a test or whatever that I missed? I May apologize if so, but I tried to find something as powerful and significant as kappa-fleiss testing, without any success.
  3. Do you think I should manipulate data as mentioned before? Is that acceptable, even if I have some parts that I must manipulate by hand?
1

There are 1 answers

1
StupidWolf On

Most likely you have to assign the cluster group you have, to the majority label in that group, I have problems copy-pasting your table, so here's using iris:

res = data.frame(clus=kmeans(scale(iris[,1:4]),3)$cluster,labels=iris$Species)

    clus    labels
145    1 virginica
146    1 virginica
147    2 virginica
148    1 virginica
149    1 virginica
150    2 virginica

We have the data above like you did, now a function to assign the label based on majority in the cluster:

pred2labels = function(pred,actual){

pred = as.character(pred)
actual = as.character(actual)

tab = as.matrix(table(pred,actual))
assignment = colnames(tab)[max.col(tab)]
names(assignment) = rownames(tab)
assignment[pred]

}

res$predicted_label = pred2labels(res$clus,res$labels)

    clus    labels predicted_label
145    1 virginica       virginica
146    1 virginica       virginica
147    2 virginica      versicolor
148    1 virginica       virginica
149    1 virginica       virginica
150    2 virginica      versicolor

Then apply the kappa from irr:

library(irr)
kappam.fleiss(res[,2:3])
 Fleiss' Kappa for m Raters

 Subjects = 150 
   Raters = 2 
    Kappa = 0.75 

        z = 13 
  p-value = 0