I have a data set with 27211 samples and 90 attributes. This data set has no class label. I want to fit gaussian mixture to data set but I dont know how to measure performance. Can you help me?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random
from sklearn.naive_bayes import GaussianNB
from sklearn import mixture
trainFile = TRAIN_PATH_NAME + "train" + str(j+1) + ".txt"
trainData = pd.read_csv(trainFile, sep=",", header=None)
np.random.seed(42)
g = mixture.GMM(n_components=60)
g.fit(trainData.values)
print("IS_COVERGED: ", g.converged_)
sampled = g.sample(trainData.values.shape[0])
return sampled
Since you do not have a a ground truth (labels), you cannot give a definite estimate of the performance and have to rely on a metric of choice. It is a quite common problem to assess the quality of clusters. Therefore there is ton of documentation arround:
There are several options to measure the performance of this unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.
But there are many more metrics to measure the performance of general clusters. They are well described in the scikit documentation. I find Silhouette-score kind of intuitive.