I was performing Gaussian Mixture modeling to find clusters within my dataset. First I used sklearn's Gaussian Mixture class and learned that 3 clusters was the optimal number according to BIC.
from sklearn.mixture import GaussianMixture as GMM
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scd = scaler.fit_transform(data)
n_components = np.arange(1, 21)
models = [GMM(n, covariance_type='full',random_state=42).fit(scd)
for n in n_components]
plt.plot(n_components, [m.bic(scd) for m in models], label='BIC')
Then I wanted to double check this in R as follows:
library(factoextra)
library(mclust)
normFunc <- function(x){(x-mean(x, na.rm = T))/sd(x, na.rm = T)}
set.seed(123)
clust<-sapply(df,normFunc)
clust<-data.frame(clust)
BIC<-Mclust(clust, G=1:20)
summary(BIC)
However, in R, mclust said 13 clusters with VEE model. Where did the difference arise? I tried to set GMM min_covar=0.0000001
as stated in this answer, but that documentation is no longer supported. My thought is that it had to do with either some minor scaling differences or the covariance_type
. Is it recommended to tune the convariance_type
in sklearn? Is covariance_type='full'
more similar to VVV model in mclust?