DBSCAN in hadoop

1.5k views Asked by At

Actually I don't know what should be the key and value for map() and what should be the input format and output format. If I read one point at a time by map() then how the neighbors can be computed using one point because remaining points are not read yet.

2

There are 2 answers

0
Has QUIT--Anony-Mousse On

DBSCAN is not an embarrassingly parallel algorithm.

It will not be trivial to represent it as map-reduce.

You will need to either employ some hacks (such as partitioning your data, and mapping each value to the partition), or completely redesign the algorithm.

There are a number of articles on parallel DBSCAN. You will likely be able to run some of these in a map-reduce like framework, or at least on a custom (non-map-reduce) YARN engine.

0
Viktor Tóth On

Check this paper out: https://www.researchgate.net/publication/261212964_A_new_scalable_parallel_DBSCAN_algorithm_using_the_disjoint-set_data_structure

The following is my solution, might be easier to understand than the solution in the paper:

First I would compute your distance matrix - that could be a sparse matrix containing only those distances, that are less than the DBSCAN epsilon parameter - find a way to implement it map-reduce.

You can map that distance matrix to multiple devices and cluster points. You realize that parallelized clustering in this case breaks up the input space and you get a cluster id in one instance that might correspond to another id at another instances.

To remedy that, gather all core points in a reduce step, then check each neighbor of every core point (map, doesn't have to be O(n^2), be clever about it). If you can find other core elements close, create an entry of the 2 cluster ids of the 2 neighboring cores; gather these id pairs (reduce). Using these pairs, derive the correct, global clustering ids.

The above description might sound a bit abstract, but it should give you the idea.