I'm currently struggling for a problem that may seem easy to solve, and that was maybe answered in previous questions, but I can't find anything on the net about this problematic.
I'm currently doing a clustering analysis on some data (k-means, hierarchical thru heatmap but whatever). I want to check if my clustering ("Cluster group" column) is consistent with a list of values ("tropism" column) attached to my individuals. Thing is that this list of values, of course, doesn't have the same levels than my clustering results. I would like to make a kappa-fleiss consistency test on both variables (clustering results v.s. list of values). Here is a shortened version of my dataframe :
Cluster group tropism
JX308829.1 all "1" "digestif"
NC_020890.1 all "1" "digestif"
KF954417.1 all "1" "peau"
HM011544.1 all "2" "peau"
MH844627.1 all "2" "peau"
HQ696595.1 all "2" "rein"
AB211390.1 all "2" "rein"
AB301101.1 all "2" "rein"
HM011559.1 all "2" "digestif"
KY404016.1 all "2" "rein"
KF444093.1 all "3" "cerveau"
KJ725028.1 all "3" "peau"
GU296408.1 all "3" "peau"
EU711058.1 all "3" "syst_resp"
KC549591.1 all "4" "syst_resp"
KR090571.1 all "4" "muscle"
AB081611.1 all "5" "muscle"
AB092581.1 all "5" "peau"
AB127351.2 all "5" "digestif"
Problem is that, naturally, kappa-fleiss consistency score compare two lists with the same levels.
I tried to create an algorithm where each cluster level is renamed following the majority of values in it, but of course it seems a little bit "manipulating the data", and I have some equalities between and inside of groups, making it hard to select values for my cluster groups. I then have multiple questions:
- Why couldn't I compare the consistency between two list of variables with different levels? It seems a little bit naive, but shouldn't consistency be measured between groups (like if "digestive" is correlated with "cluster group 1" or whatever)? Is there an option that I missed in the kappam.fleiss() function?
- Is there a function, a test or whatever that I missed? I May apologize if so, but I tried to find something as powerful and significant as kappa-fleiss testing, without any success.
- Do you think I should manipulate data as mentioned before? Is that acceptable, even if I have some parts that I must manipulate by hand?
Most likely you have to assign the cluster group you have, to the majority label in that group, I have problems copy-pasting your table, so here's using iris:
We have the data above like you did, now a function to assign the label based on majority in the cluster:
Then apply the kappa from
irr
: