Actually I don't know what should be the key and value for map() and what should be the input format and output format. If I read one point at a time by map() then how the neighbors can be computed using one point because remaining points are not read yet.
DBSCAN in hadoop
1.5k views Asked by Girjesh AtThere are 2 answers
Check this paper out: https://www.researchgate.net/publication/261212964_A_new_scalable_parallel_DBSCAN_algorithm_using_the_disjoint-set_data_structure
The following is my solution, might be easier to understand than the solution in the paper:
First I would compute your distance matrix - that could be a sparse matrix containing only those distances, that are less than the DBSCAN epsilon parameter - find a way to implement it map-reduce.
You can map that distance matrix to multiple devices and cluster points. You realize that parallelized clustering in this case breaks up the input space and you get a cluster id in one instance that might correspond to another id at another instances.
To remedy that, gather all core points in a reduce step, then check each neighbor of every core point (map, doesn't have to be O(n^2), be clever about it). If you can find other core elements close, create an entry of the 2 cluster ids of the 2 neighboring cores; gather these id pairs (reduce). Using these pairs, derive the correct, global clustering ids.
The above description might sound a bit abstract, but it should give you the idea.
DBSCAN is not an embarrassingly parallel algorithm.
It will not be trivial to represent it as map-reduce.
You will need to either employ some hacks (such as partitioning your data, and mapping each value to the partition), or completely redesign the algorithm.
There are a number of articles on parallel DBSCAN. You will likely be able to run some of these in a map-reduce like framework, or at least on a custom (non-map-reduce) YARN engine.