Clustering based on pearson correlation

3.8k views Asked by At

I have a use case where I have traffic data for every 15 minutes for 1 month. This data is collected for various resources in netwrok.

Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs).

One way to check if two resources have similar traffic behavior is that I can use Pearson correlation coefficient for all the resources and create N*N matrix.

My question is which method I should apply to cluster the similar resources ? Existing methods in K-Means clustering are based on euclidean distance. Which algorithm I can use to cluster based on similarity of pattern ?

Any thoughts or link to possible solution is welcome. I want to implement using Java.

1

There are 1 answers

1
Has QUIT--Anony-Mousse On BEST ANSWER

Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation.

Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can set a threshold: an absolute correlation of, e.g. +0.75, may be a desireable value of epsilon. But to get a feeling of your distance function, dendrograms as used by HAC are probably easier.

Beware that Pearson is not defined for constant patterns. If you have a resource with 0 usage, your distance will be undefined.