How can I write the clustering results from mclust to file?

12.7k views Asked by At

I'm using the mclust library for R ( http://www.stat.washington.edu/mclust ) to do some experimental EM-based GMM clustering. The package is great and seems to generally find very good clusters for my data.

The problem is that I don't really know R at all, and while I have managed to muddle through the clustering process based on the help() contents and the extensive readme, I cannot for the life of me figure out how to write out the actual cluster results to file. I am using the following absurdly simple script to perform the clustering,

myData <- read.csv("data.csv", sep=",", header=FALSE)
attach(myData)
myBIC <- mclustBIC(myData)
mySummary <- summary( myBIC, data=myData )

at which point I have cluster results and a summary. The data in data.csv is just a list of multi-dimensional points, one per line. So each line looks like 'x,y,z' (in the case of 3 dimensions).

If I use 2d points (e.g. just the x and y vals) I can then use the internal plot function to get a very pretty graph that plots the original points and color codes each point based on the cluster it was assigned to. So I know all the info is somewhere in 'myBIC', but the docs and help don't seem to provide any insight as to how to print out this data!

I want to print out a new file based on the results I believe are encoded in myBIC. Something like,

CLUST x, y, z
1 1.2, 3.4, 5.2
1 1.2, 3.3, 5.2
2 5.5, 1.3, 1.3
3 7.1, 1.2, -1.0
3 7.2, 1.2, -1.1

and then - hopefully - also print out the parameters/centroids of the individual gaussians/clusters that the clustering process found.

Surely this is an absurdly easy thing to do and I'm just too ignorant of R to figure it out...

EDIT: I seem to have gotten a little bit further along. Doing the following prints out a somewhat cryptic matrix,

    > mySummary$classification
[1] 1 1 2 1 3
[6] 1 1 1 3 1
[12] 1 2 1 3 1
[18] 1 3 

which upon reflection I realized is actually the list of samples and their classifications. I guess it is not possible to write this directly via the write command, but a bit more experimentation in the R console lead me to realize that I can do this:

> newData <- mySummary$classification
> write( newData, file="class.csv" )

and that the result actually looks pretty nice!

 $ head class.csv
"","x"
"1",1
"2",2
"3",2

where the first column apparenly matches the index for the input data, and the second column describes the assigned class identity.

The 'mySummary$parameters' object appears to be nested though, and has a bunch of sub-objects corresponding to the individual gaussians and their parameters, etc. The 'write' function fails when I try to just write it out, but individually writing out each sub object name is a bit tedious. Which leads me to a new question: how do I iterate over a nested object in R and print the elements out in a serial fashion to a file descriptor?

I have this 'mySummary$parameters' object. It is composed of several sub-objects like 'mySummary$parameters$variance$sigma', etc. I would like to just iterate over everything and print it all to file in the same way that this is done to the CLI automatically...

1

There are 1 answers

4
mathematical.coffee On BEST ANSWER

To calculate the actual clustering parameters themselves (mean, variance, what cluster each point belongs to), you need to use Mclust. To do the writing you can use (for example) write.csv.

By default Mclust calculates the parameters based on the most optimal model as determined by BIC, so if that's what you want to do, you can do:

myMclust <- Mclust(myData)

Then myMclust$BIC will contain the results for all the other models (ie myMclust$BIC is more-or-less the same as mclustBIC(myData)).

See ?Mclust in the Value: section to see what other information myMclust has. For example, myMclust$parameters$mean is the mean for each cluster, myMclust$parameters$variance the variance for each cluster, ...

However myMclust$classification will contain which cluster each point belongs to, calculated for the most optimal model.

So, to get the output you want, you can do:

# create some data for example purposes -- you have your read.csv(...) instead.
myData <- data.frame(x=runif(100),y=runif(100),z=runif(100))
# get parameters for most optimal model
myMclust <- Mclust(myData)
# if you wanted to do your summary like before:
mySummary <- summary( myMclust$BIC, data=myData )

# add a column in myData CLUST with the cluster.
myData$CLUST <- myMclust$classification
# now to write it out:
write.csv(myData[,c("CLUST","x","y","z")], # reorder columns to put CLUST first
          file="out.csv",                  # output filename
          row.names=FALSE,                 # don't save the row numbers
          quote=FALSE)                     # don't surround column names in ""

A note on the write.csv - if you don't put in row.names=FALSE you'll get an extra column in your csv containing the row number. Also, quote=FALSE puts your column headings as CLUST,x,y,z whereas otherwise they'd be "CLUST","x","y","z". It's your choice.

Suppose we wanted to do the same, but use the parameters from a different model that was not optimal. However, Mclust calculates parameters only for the optimal model by default. To calculate parameters for a particular model (say "EEI"), you'd do:

myMclust <- Mclust(myData,modelNames="EEI")

and then proceed as before.