clustering vs fitting a mixture model

244 views Asked by At

I have a question about using a clustering method vs fitting the same data with a distribution.

Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.

My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.

a. First approach with clustering:

I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.

b. Second approach using mixture model:

I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.


In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.

I would go for a clustering method for example a kmean because:

  1. If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.

  2. With new data the posterior probability changes.

  3. The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).

More info about the data:

Features shouldn't be assumed dependent. Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.

Please help me to think and if you have any other point please let me know..

0

There are 0 answers