How is the generalised dbscan (gdbscan) in elki implemented in Java/Scala? I am currently trying to find an efficient way to implement a weighted dbscan on elki to offset the inefficiencies coming from the sklearn implementation of the weighted dbscan.
The reason I am doing this at the moment is because the sklearn simply sucks for implementing the dbscan on clusters on datasets on the terabyte scale (on the cloud, which in this case I am).
For example, I have made the following code with the database creation function and the dbscan function that reads an array of arrays, and spits out the indices of the cluster indices.
/* Libraries imported from the ELKI library - https://elki-project.github.io/releases/current/doc/overview-summary.html */
import de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansElkan
import de.lmu.ifi.dbs.elki.data.model.{ClusterModel, DimensionModel, KMeansModel, Model}
import de.lmu.ifi.dbs.elki.data.model
import de.lmu.ifi.dbs.elki.data.{Clustering, DoubleVector, NumberVector}
import de.lmu.ifi.dbs.elki.database.{Database, StaticArrayDatabase}
import de.lmu.ifi.dbs.elki.datasource.ArrayAdapterDatabaseConnection
import de.lmu.ifi.dbs.elki.distance.distancefunction.minkowski.SquaredEuclideanDistanceFunction
import de.lmu.ifi.dbs.elki.distance.distancefunction.minkowski.EuclideanDistanceFunction
import de.lmu.ifi.dbs.elki.distance.distancefunction.NumberVectorDistanceFunction
import de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN
// Imports for generalized DBSCAN
import de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan // Generalized dbscan function here required for weighted dbscan
import de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.CorePredicate // THIS IS IMPORTANT TO GET GENERALIZED DBSCAN
import de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN
import de.lmu.ifi.dbs.elki.utilities.ELKIBuilder
import de.lmu.ifi.dbs.elki.database.relation.Relation
import de.lmu.ifi.dbs.elki.datasource.DatabaseConnection
import de.lmu.ifi.dbs.elki.database.ids.DBIDIter
import de.lmu.ifi.dbs.elki.index.tree.metrical.covertree.SimplifiedCoverTree
import de.lmu.ifi.dbs.elki.data.{`type`=>TYPE} // Need to import in this way as 'type' is a class method in Scala
import de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeFactory // Important
def createDatabaseWeighted(data: Array[Array[Double]], distanceFunction: NumberVectorDistanceFunction[NumberVector]): Database = {
val indexFactory = new SimplifiedCoverTree.Factory[NumberVector](distanceFunction, 0, 30)
// Create a database
val db = new StaticArrayDatabase(new ArrayAdapterDatabaseConnection(data), java.util.Arrays.asList(indexFactory))
// Load the data into the database
val CustomPredicate = CorePredicate
db
}
def dbscanClusteringOriginalTest(data: Array[Array[Double]], distanceFunction: NumberVectorDistanceFunction[NumberVector] = SquaredEuclideanDistanceFunction.STATIC, epsilon: Double = 10, minpts: Int = 10) = {
// Use the same `distanceFunction` for the database and DBSCAN <- is it required??
val db = createDatabaseWeighted(data, distanceFunction)
val rel = db.getRelation(TYPE.TypeUtil.NUMBER_VECTOR_FIELD) // Create the required relational database
val dbscan = new DBSCAN[DoubleVector](distanceFunction, epsilon, minpts) // Epsilon and minpoints needed - either you define in the function input, or will use default values
val result: Clustering[Model] = dbscan.run(db)
var ClusterCounter = 0 // Indexing the number of datapoints allocated from DBSCAN
result.getAllClusters.asScala.zipWithIndex.foreach { case (cluster, idx) =>
println("The type is " + cluster.getNameAutomatic)
/* Isolate only the clusters and store the median from the DBSCAN results */
if (cluster.getNameAutomatic == "Cluster" || cluster.getNameAutomatic == "Noise") {
ClusterCounter += 1
val ArrayMedian = Array[Double]()
println(s"# $idx: ${cluster.getNameAutomatic}")
println(s"Size: ${cluster.size()}")
println(s"Model: ${cluster.getModel}")
println(s"ids: ${cluster.getIDs.iter().toString}")
}
}
}
I can get this to run quite efficiently, but I am currently struggling on how I can get a similar effect with the gdbscan function. For example, there was an answer that suggested that this could be done by modifying the CorePredicate on ELKI (sample_weight option in the ELKI implementation of DBSCAN) but I am not sure how this could be implemented.
Any pointers would be highly appreciated!
Implement your own GDBSCAN core predicate.
Rather than counting neighbors as in the standard implementation, add their weights.
Then you have weighted DBSCAN.