DBSCAN clusters of cluster (sklearn python)

844 views Asked by At

I have elements of different categories that need to be clustered separately (according to their category) and then all together. Each element has a location (latitude,longitude).

My goal is to determine the clusters (group of different categories) of cluster (group of different elements in the same category) like in the following pictures: https://i.stack.imgur.com/B5uej.png

In my case the distance between two elements that should be included in a cluster is the same distance as the distance between two clusters of clusters. For example in the picture with the blue cluster. Since all the elements in this blue cluster are separeted by a distance of d at most (from any elements of the cluster) then they belong in the blue cluster. It's the same for the red cluster where we include the elements that are separated by a distance of d at most

With DBSCAN I can easily find the clusters of all of these elements if I provide as input all the elements together. And If I want to find the clusters of each category, then I will have to provide as input only the different category and run DBSCAN one by one. But I guess there should be something much faster than running many times DBSCAN to get these clusters of clusters

2

There are 2 answers

0
Has QUIT--Anony-Mousse On

Why do you think it would be faster to mix categories that you want to be separate?

Do the cheap operations first, such as splitting your data set. Then process each partition independently.

As far as I know, scipy cannot accelerate geodetic distances. So you will have to do O(n^2) distance computations. If you have 10 categories, your problem gets 10x faster if you can split it into such partitions, and run DBSCAN 10 times, because each run is 10^2 times cheaper!

0
Aramis7d On

It seems to me the main problem here is due to the multi-representation or hierarchical nature (categories and clusters within categories) of your data. Typically, if the distances are based on a singular dimension, the two dimensions (say, cluster distance and category distance) could be clubbed to form a new, singular dimension where the data representation becomes simpler.

Maybe this helps?

Some material I found that may be helpful: