Clustering data using DBSCAN and spark_sklearn

3.1k views Asked by At

I want to cluster my input data using DBSCAN and spark_sklearn. I'd like to get the labels of each input instance after clustering. Is it possible?

Reading the documentation on http://pythonhosted.org/spark-sklearn, I tried the following:

temp_data = Spark DataFrame containing 'key' and 'features' columns, 
            where 'features' is a Vector.

ke = KeyedEstimator(sklearnEstimator=DBSCAN(), estimatorType="clusterer")
print ke.getOrDefault("estimatorType") --> "clusterer"

ke.fit_pedict(temp_data) --> ERROR: 'KeyedEstimator' object has no attribute 'fit_predict'

k_model = ke.fit(temp_data)
print k_model.getOrDefault("estimatorType") --> "clusterer"

k_model.fit_pedict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'fit_predict'

k_model.predict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'predict'

k_model.transform(temp_data) --> ERROR: estimatorType assumed to be a clusterer, but sklearnEstimator is missing fit_predict() 
(NOTE: sklearn.cluster.DBSCAN actually have fit_predict() method)

What I normally do using sklearn (without spark) is to fit (dbscan_model.fit(temp_data-features)) and get labels from the model (labels = dbscan_model.labels_). It is also fine if I can get the 'labels_' attribute using spark-sklearn.

If the above-mentioned calls ('transform' or 'predict') doesn't work, is it possible to get the 'labels_' after fitting data using spark-sklearn? How can I do that? Assuming that we obtained the 'labels_', how can I map the input instances to the labels_? Do they have same order?

2

There are 2 answers

0
user2737636 On

I've managed to get the 'labels_' attribute; however I still don't know if the order of resulting labels are same as the input instances or not.

temp_data = Spark DataFrame containing 'key' and 'features' columns, 
        where 'features' is a Vector.

ke = KeyedEstimator(sklearnEstimator=DBSCAN())
k_model = ke.fit(temp_data)

def getLabels(model):
    return model.estimator.labels_

labels_udf = udf(lambda x: getLabels(x).tolist(), ArrayType(IntegerType()))("estimator").alias("labels")
res_df = km_dbscan.keyedModels.select("key", labels_udf)
2
eliasah On

It's just possible in the case of KMeans, in which we can predict cluster labels, since the scikit-learn estimator provides this functionality.

Unfortunately, this is not the case for some other clusterers, such as DBSCAN.