I have a use case where I have traffic data for every 15 minutes for 1 month. This data is collected for various resources in netwrok.
Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs).
One way to check if two resources have similar traffic behavior is that I can use Pearson correlation coefficient for all the resources and create N*N matrix.
My question is which method I should apply to cluster the similar resources ? Existing methods in K-Means clustering are based on euclidean distance. Which algorithm I can use to cluster based on similarity of pattern ?
Any thoughts or link to possible solution is welcome. I want to implement using Java.
Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation.
Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can set a threshold: an absolute correlation of, e.g. +0.75, may be a desireable value of epsilon. But to get a feeling of your distance function, dendrograms as used by HAC are probably easier.
Beware that Pearson is not defined for constant patterns. If you have a resource with 0 usage, your distance will be undefined.