I'm using the mclust library for R ( http://www.stat.washington.edu/mclust ) to do some experimental EM-based GMM clustering. The package is great and seems to generally find very good clusters for my data.
The problem is that I don't really know R at all, and while I have managed to muddle through the clustering process based on the help() contents and the extensive readme, I cannot for the life of me figure out how to write out the actual cluster results to file. I am using the following absurdly simple script to perform the clustering,
myData <- read.csv("data.csv", sep=",", header=FALSE)
attach(myData)
myBIC <- mclustBIC(myData)
mySummary <- summary( myBIC, data=myData )
at which point I have cluster results and a summary. The data in data.csv is just a list of multi-dimensional points, one per line. So each line looks like 'x,y,z' (in the case of 3 dimensions).
If I use 2d points (e.g. just the x and y vals) I can then use the internal plot function to get a very pretty graph that plots the original points and color codes each point based on the cluster it was assigned to. So I know all the info is somewhere in 'myBIC', but the docs and help don't seem to provide any insight as to how to print out this data!
I want to print out a new file based on the results I believe are encoded in myBIC. Something like,
CLUST x, y, z
1 1.2, 3.4, 5.2
1 1.2, 3.3, 5.2
2 5.5, 1.3, 1.3
3 7.1, 1.2, -1.0
3 7.2, 1.2, -1.1
and then - hopefully - also print out the parameters/centroids of the individual gaussians/clusters that the clustering process found.
Surely this is an absurdly easy thing to do and I'm just too ignorant of R to figure it out...
EDIT: I seem to have gotten a little bit further along. Doing the following prints out a somewhat cryptic matrix,
> mySummary$classification
[1] 1 1 2 1 3
[6] 1 1 1 3 1
[12] 1 2 1 3 1
[18] 1 3
which upon reflection I realized is actually the list of samples and their classifications. I guess it is not possible to write this directly via the write command, but a bit more experimentation in the R console lead me to realize that I can do this:
> newData <- mySummary$classification
> write( newData, file="class.csv" )
and that the result actually looks pretty nice!
$ head class.csv
"","x"
"1",1
"2",2
"3",2
where the first column apparenly matches the index for the input data, and the second column describes the assigned class identity.
The 'mySummary$parameters' object appears to be nested though, and has a bunch of sub-objects corresponding to the individual gaussians and their parameters, etc. The 'write' function fails when I try to just write it out, but individually writing out each sub object name is a bit tedious. Which leads me to a new question: how do I iterate over a nested object in R and print the elements out in a serial fashion to a file descriptor?
I have this 'mySummary$parameters' object. It is composed of several sub-objects like 'mySummary$parameters$variance$sigma', etc. I would like to just iterate over everything and print it all to file in the same way that this is done to the CLI automatically...
To calculate the actual clustering parameters themselves (mean, variance, what cluster each point belongs to), you need to use
Mclust
. To do the writing you can use (for example)write.csv
.By default
Mclust
calculates the parameters based on the most optimal model as determined by BIC, so if that's what you want to do, you can do:Then
myMclust$BIC
will contain the results for all the other models (iemyMclust$BIC
is more-or-less the same asmclustBIC(myData)
).See
?Mclust
in theValue:
section to see what other informationmyMclust
has. For example,myMclust$parameters$mean
is the mean for each cluster,myMclust$parameters$variance
the variance for each cluster, ...However
myMclust$classification
will contain which cluster each point belongs to, calculated for the most optimal model.So, to get the output you want, you can do:
A note on the
write.csv
- if you don't put inrow.names=FALSE
you'll get an extra column in your csv containing the row number. Also,quote=FALSE
puts your column headings asCLUST,x,y,z
whereas otherwise they'd be"CLUST","x","y","z"
. It's your choice.Suppose we wanted to do the same, but use the parameters from a different model that was not optimal. However,
Mclust
calculates parameters only for the optimal model by default. To calculate parameters for a particular model (say"EEI"
), you'd do:and then proceed as before.