Yesterday I implemented a GMM (Gaussian Mixture Model) using expectation-maximization algorithm.
As you remember, it models some uknown distribution as a mixture of gaussians which we need to learn its means and variances, and also the weights for each gaussian.
this is the mathematics behind the code (its not that complicated) http://mccormickml.com/2014/08/04/gaussian-mixture-models-tutorial-and-matlab-code/
this is my code:
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
#reference for this code is http://mccormickml.com/2014/08/04/gaussian-mixture-models-tutorial-and-matlab-code/
def expectation(data, means, covs, priors): #E-step. returns the updated probabilities
m = data.shape[0] #gets the data, means covariances and priors of all clusters
numOfClusters = priors.shape[0]
probabilities = np.zeros((m, numOfClusters))
for i in range(0, m):
for j in range(0, numOfClusters):
sum = 0
for l in range(0, numOfClusters):
sum += normalPDF(data[i, :], means[l], covs[l]) * priors[l, 0]
probabilities[i, j] = normalPDF(data[i, :], means[j], covs[j]) * priors[j, 0] / sum
return probabilities
def maximization(data, probabilities): #M-step. this updates the means, covariances, and priors of all clusters
m, n = data.shape
numOfClusters = probabilities.shape[1]
means = np.zeros((numOfClusters, n))
covs = np.zeros((numOfClusters, n, n))
priors = np.zeros((numOfClusters, 1))
for i in range(0, numOfClusters):
priors[i, 0] = np.sum(probabilities[:, i]) / m #update priors
for j in range(0, m): #update means
means[i] += probabilities[j, i] * data[j, :]
vec = np.reshape(data[j, :] - means[i, :], (n, 1))
covs[i] += probabilities[j, i] * np.dot(vec, vec.T) #update covs
means[i] /= np.sum(probabilities[:, i])
covs[i] /= np.sum(probabilities[:, i])
return [means, covs, priors]
def normalPDF(x, mean, covariance): #this is simply multivariate normal pdf
n = len(x)
mean = np.reshape(mean, (n, ))
x = np.reshape(x, (n, ))
var = multivariate_normal(mean=mean, cov=covariance,)
return var.pdf(x)
def initClusters(numOfClusters, data): #initialize all the gaussian clusters (means, covariances, priors
m, n = data.shape
means = np.zeros((numOfClusters, n))
covs = np.zeros((numOfClusters, n, n))
priors = np.zeros((numOfClusters, 1))
initialCovariance = np.cov(data.T)
for i in range(0, numOfClusters):
means[i] = np.random.rand(n) #the initial mean for each gaussian is chosen randomly
covs[i] = initialCovariance #the initial covariance of each cluster is the covariance of the data
priors[i, 0] = 1.0 / numOfClusters #the initial priors are uniformly distributed.
return [means, covs, priors]
def logLikelihood(data, probabilities): #data is our data. probabilities[i, j] = k means probability example i belongs in cluster j is 0 < k < 1
m = data.shape[0] #num of examples
examplesByCluster = np.zeros((m, 1))
for i in range(0, m):
examplesByCluster[i, 0] = np.argmax(probabilities[i, :])
examplesByCluster = examplesByCluster.astype(int) #examplesByCluster[i] = j means that example i belongs in cluster j
result = 0
for i in range(0, m):
result += np.log(probabilities[i, examplesByCluster[i, 0]]) #example i belongs in cluster examplesByCluster[i, 0]
return result
m = 2000 #num of training examples
n = 8 #num of features for each example
data = np.random.rand(m, n)
numOfClusters = 2 #num of gaussians
numIter = 30 #num of iterations of EM
cost = np.zeros((numIter, 1))
[means, covs, priors] = initClusters(numOfClusters, data)
for i in range(0, numIter):
probabilities = expectation(data, means, covs, priors)
[means, covs, priors] = maximization(data, probabilities)
cost[i, 0] = logLikelihood(data, probabilities)
plt.plot(cost)
plt.show()
the problem is that the loglikelihood is behaving strange. I expect it to be monotonic increasing. But it's not.
For example, with 2000 examples of 8 features with 3 gaussian clusters, the loglikelihood looks like this (30 iterations) -
So this is very bad. But on other tests I ran, for example one test with 15 examples of 2 features and 2 clusters, the loglikelihood is this -
Better but still not perfect.
Why does this happen and how can I fix it?
The problem lies in the maximization step.
The code uses
means
for calculatingcovs
. However this is done in the same loop, before dividingmeans
by the sum of probabilities.This causes the estimated covariances to explode.
Here's a suggested fix:
And the resulting cost function (200 data points, 4 features):
EDIT: I was convinced this bug was the only issue in the code, however running a few additional examples, I still sometimes see non-monotonic behaviour (although less erratic than before). So this seems to be just part of the problem.
EDIT2: There was another issue in covariance calculation: the vector multiplication should be element-wise, not dot-product - keep in mind that the result should be a vector. Results seem to be consistently-monotonically-increasing now.