Clustering with Mclust results in an empty cluster

997 views Asked by At

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:

library(reshape2)
library(mclust)

data <- read.csv(file.choose(), header=TRUE,  check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)

fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)

R gives me the following result:

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust E (univariate, equal variance) model with 4 components: 

log-likelihood    n df       BIC       ICL
  -20504.71 3258  8 -41074.13 -44326.69

Clustering table:
1    2    3    4 
0 2271  896   91 

Mixing probabilities:
    1         2         3         4 
0.2807685 0.4342499 0.2544305 0.0305511 

Means:
   1        2        3        4 
1381.391 1381.715 1574.335 1851.667 

Variances:
   1        2        3        4 
7466.189 7466.189 7466.189 7466.189 

Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html

I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.

This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?

I am somewhat clueless as to where my problem is or if I am missing anything obvious. Any help is appreciated!

1

There are 1 answers

0
StupidWolf On BEST ANSWER

As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):

set.seed(111)
data <- read.csv("example.csv", header=TRUE,  check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)

enter image description here

Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication

In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:

fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627

This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:

head(predict(fit,1350:1400)$z)
             1         2          3            4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09

The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:

 head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2

To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.

If you use model="V", i see the solution is equally problematic:

fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")

enter image description here

Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:

library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))

enter image description here

If you don't need equal variance, you can use the GMM function in the ClusterR package too.