I would like to cluster some data using k Means in R that looks as follows.
ADP NS CNTR PP2V EML PP1V ADDPS FB PP1D ADR ISV PP2D ADSEM SUMALL CONV
2 0 0 1 0 0 0 0 0 12 0 12 0 53 0
2 0 0 1 0 0 0 0 0 14 0 25 0 53 0
2 0 0 1 0 0 0 0 0 15 0 0 0 53 0
2 0 0 1 0 0 0 0 0 15 0 4 0 53 0
2 0 0 1 0 0 0 0 0 17 0 0 0 53 0
2 0 0 1 0 0 0 0 0 18 0 0 0 106 0
2 0 0 1 0 0 0 0 0 23 0 10 0 53 0
2 0 0 1 0 0 1 0 0 0 0 1 0 106 0
2 0 0 1 0 0 3 0 0 0 0 0 0 53 0
2 0 0 2 0 0 0 0 0 0 0 0 0 3922 0
2 0 0 2 0 0 0 0 0 0 0 1 0 530 0
2 0 0 2 0 0 0 0 0 0 0 2 0 954 0
2 0 0 2 0 0 0 0 0 0 0 3 0 477 0
2 0 0 2 0 0 0 0 0 0 0 4 0 265 0
2 0 0 2 0 0 0 0 0 0 0 5 0 742 0
2 0 0 2 0 0 0 0 0 0 0 6 0 265 0
2 0 0 2 0 0 0 0 0 0 0 7 0 265 0
The column "SUMALL" is the number of times that a particular combination of variables is observed in the data.
So when using k means I would like to be able to use this column as a 'weight' for that particular combination so that the frequent combinations get more importance (also so that the cluster features are given as weighted averages).
I can't see a simple way to do this in the standard cluster
package, can anyone advise on whether there is a simple way to do this?
Since
SUMALL
is the number of times a particular observation occurred, you could create a new dataset where each row is replicated the correct number of times, and then do your clustering with that new dataset.Here's a simple example of expanding the dataset for replicate rows
Then we need to expand
df
by replicating rows according toSUMALL
Then use that with your favorite clustering method.