Clustering with Mclust results in an empty cluster

Question

Clustering with Mclust results in an empty cluster

967 views Asked by Jara At 12 October 2020 at 18:41

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:

library(reshape2)
library(mclust)

data <- read.csv(file.choose(), header=TRUE,  check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)

fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)

R gives me the following result:

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust E (univariate, equal variance) model with 4 components: 

log-likelihood    n df       BIC       ICL
  -20504.71 3258  8 -41074.13 -44326.69

Clustering table:
1    2    3    4 
0 2271  896   91 

Mixing probabilities:
    1         2         3         4 
0.2807685 0.4342499 0.2544305 0.0305511 

Means:
   1        2        3        4 
1381.391 1381.715 1574.335 1851.667 

Variances:
   1        2        3        4 
7466.189 7466.189 7466.189 7466.189

Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html

I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.

This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?

I am somewhat clueless as to where my problem is or if I am missing anything obvious. Any help is appreciated!

Original Q&A

There are 1 answers

**StupidWolf** · Accepted Answer · 2020-10-12T23:01:34+00:00

As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):

set.seed(111)
data <- read.csv("example.csv", header=TRUE,  check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)

Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication

In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:

fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627

This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:

head(predict(fit,1350:1400)$z)
             1         2          3            4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09

The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:

 head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2

To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.

If you use model="V", i see the solution is equally problematic:

fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")

Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:

library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))

If you don't need equal variance, you can use the GMM function in the ClusterR package too.

TechQA.

Clustering with Mclust results in an empty cluster

There are 1 answers

Related Questions in R

Related Questions in CLUSTER-ANALYSIS

Related Questions in GMM

Related Questions in MCLUST

Popular Questions

Popular Tags

Trending Questions