I'm using Python to cluster a 5D set of data. And each run generates a different set of clusters. I'm simply curious as to why this is.
Here's the code:
df = pd.read_csv('database.csv')
ratios = df.drop(['patient', 'class'], axis=1)
gaussian = GaussianMixture(n_components=7).fit(ratios).predict(ratios)
df['gaussian'] = gaussian
cluster_counts = Counter(df['gaussian'])
centroids = NearestCentroid().fit(ratios, gaussian).centroids_
sum_of_distances = np.zeros((len(centroids), 5))
Here's a graph showing the sum of the average distances to the centroid for one run:
And here's a graph for another run:
You can see that the bar for Gaussian mixture varies from one to another, however, no other clustering algorithms change.
If someone could explain why this happens it would be much appreciated.
MixtureGaussian Documentation You are interested in
random_state
parameter. Each time you run the model the initialization of the parameters may differ.More about random and seed in python: random.seed(): What does it do?