In a typical clustering problem, the probability of a data point x is p(x) = sum_k p(k)p(x|k)
, where k
is a latent variable specifying the cluster that x belongs to. We can use EM algorithm to maximize the log likelihood of the objective function for the training data set: sum_n log (sum_k p(k)(p(x|k)))
.
I wonder if EM algorithm can solve the problem with two sets of latent variables, i.e.
p(x) = sum_k sum_l p(x|k, l)p(k)p(l)
? If so, how can we do that?
What if all of the probability distributions are sigmoid functions?
This should be just the straightforward application of the EM algorithm as a way of solving hidden data problems - the hidden data is the underlying value of k and l at each step. In the E step you work out the expected log likelihood, considering each possible value of the pair (k,l), using the probability of this, given the data and the current parameter settings as a weight. In the M state you find the parameters that maximise this expected log likelihood. This is very similar to just encoding the pair (k,l) as a single index, m, except that there is more structure in p(k)p(l) than there is in p(m), which will affect the M step very slightly.
If the probabilities are sigmoid - any any other probability distribution - the justification of the EM algorithm still holds: that each step increases or leaves unchanged the log likelihood. However you may find that the M-step becomes more expensive if the optimisation problem gets harder.