Find partial membership with KMeans clustering algorithm

638 views Asked by At

I can calculate cluster membership with KMeans pretty easily:

open System
open System.IO
open Utils
open Accord
open Accord.Math 
open Accord.MachineLearning

let vals = [|
    [|1.0; 2.0; 3.0; 2.0|]
    [|1.1; 1.9; 3.1; 4.0|]
    [|2.0; 3.0; 4.0; 4.0|]    
    [|3.0; 3.1; 2.0; 3.0|]
    [|2.0; 4.0; 3.0; 6.0|]
    [|1.0; 5.0; 5.0; 7.0|]
    [|4.0; 3.0; 6.0; 8.0|]
    [|5.0; 4.0; 3.0; 6.0|]
    [|6.0; 4.0; 8.0; 7.0|]
    [|5.0; 6.0; 5.0; 9.0|]
    [|4.0; 2.0; 7.0; 8.0|]
    [|8.0; 9.0; 3.1; 2.2|]
    [|8.0; 9.0; 2.0; 2.0|]
    [|10.0; 2.0; 3.0; 2.0|]
    [|10.1; 1.9; 3.1; 4.0|]
    [|20.0; 3.0; 4.0; 4.0|]
    [|22.0; 7.0; 2.0; 3.0|]
    [|21.0; 4.0; 3.0; 6.0|]
|]

let kmeans = new KMeans 5
let clusterModel = kmeans.Learn vals
let clusters = clusterModel.Decide vals

Can I calculate partial membership with the standard KMeans algorithm? A coworker suggested using the mean and variances of the cluster members to determine proportional membership and today I've been looking into fuzzy sets and their implementations for F#. For example, here is some documentation for the Accord.net implementation for fuzzy sets. I can translate/run the example for F# but at first glance, I don't see an easy way to get the data from my Kmeans run above to fit the format of assigning partial membership.

Questions:

  1. How would I use the mean/variance of cluster members to calculate partial membership?

  2. Is there an easy way to calculate partial membership with KMeans clustering with the Accord.net library?

  3. The KMeans algorithm in Accord.net is simple to implement; should I spend some time trying to learn this method of clustering/membership to suit my problem rather than try and force Kmeans clustering to suit my needs?

2

There are 2 answers

0
Anton Schwaighofer On BEST ANSWER

As mentioned by Tomas, Accord.NET already gives you many of the building blocks. In particular, calling clusterModel.Scores gives you the (negative) distances to the cluster centroids, see source code

From the negative distances, you can compute an approximate class membership score by taking exponentials, similar to what you would do to compute a Gaussian PDF. In F#, that would look like:

// Scores returns the negative distances between each point
// and the cluster centroid
let negDistances = clusterModel.Scores vals
// Compute an estimated cluster assigment score
let clusterMembership =
    negDistances
    |> Array.map (fun distances ->
        // Take the Exponential of the (negative) distances,
        // as in computing a Gaussian pdf
        let expDist = distances |> Array.map Math.Exp
        let total = Array.sum expDist
        expDist
        |> Array.map (fun d -> d/total)
    )

There are a couple of caveats here:

  • Standard KMeans in Accord uses Euclidean distances, meaning that each direction carries the same weight. Depending on the nature of your data, this may not lead to reasonable results (picture 2 clusters, each shaped like a long cigar)
  • The above class membership calculcation is not taking cluster covariance into account either. To be closer to truth, you would have to compute Bhattacharyya distance, exponentiate, and then scale by inverse det of the covariance matrix. This will fail for singleton clusters.

Regarding your third question: I would not re-implement. It may seem easy initially, but there are usually plenty of corner cases and stability issues that you only run into after some time.

0
Tomas Petricek On

You should be able to use Accord.NET to get the "centroids" of the clusters that the K-means algorithm finds. Those are essentially the centres of the individual clusters. You should then be able to calculate the distance between your new data point and each of the centroids to see which of the centroids are close to your point. (The Decide method returns just the first one.)

I have not tried this, but it seems that KMeans exposes Clusters, which is a KMeansClusterCollection and has the Centroids property (see the docs). It also exposes the Distance property which returns the function for calculating distance between the data points.

Using these, you should be able to compare the distance of your data point with the centroids of all the clusters and decide how close the point is to individual clusters.

Implementing k-means from scratch is not that hard (there is a nice post from Mathias Brandewinder on this), but it seems that Accord.NET exposes all the information that you need in this particular case - so perhaps that's all you need (getting all the details right in custom implementation is always the hardest part...).