Clustering data using DBSCAN and spark_sklearn

Question

Clustering data using DBSCAN and spark_sklearn

3.1k views Asked by user2737636 At 03 January 2017 at 09:25

I want to cluster my input data using DBSCAN and spark_sklearn. I'd like to get the labels of each input instance after clustering. Is it possible?

Reading the documentation on http://pythonhosted.org/spark-sklearn, I tried the following:

temp_data = Spark DataFrame containing 'key' and 'features' columns, 
            where 'features' is a Vector.

ke = KeyedEstimator(sklearnEstimator=DBSCAN(), estimatorType="clusterer")
print ke.getOrDefault("estimatorType") --> "clusterer"

ke.fit_pedict(temp_data) --> ERROR: 'KeyedEstimator' object has no attribute 'fit_predict'

k_model = ke.fit(temp_data)
print k_model.getOrDefault("estimatorType") --> "clusterer"

k_model.fit_pedict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'fit_predict'

k_model.predict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'predict'

k_model.transform(temp_data) --> ERROR: estimatorType assumed to be a clusterer, but sklearnEstimator is missing fit_predict() 
(NOTE: sklearn.cluster.DBSCAN actually have fit_predict() method)

What I normally do using sklearn (without spark) is to fit (dbscan_model.fit(temp_data-features)) and get labels from the model (labels = dbscan_model.labels_). It is also fine if I can get the 'labels_' attribute using spark-sklearn.

If the above-mentioned calls ('transform' or 'predict') doesn't work, is it possible to get the 'labels_' after fitting data using spark-sklearn? How can I do that? Assuming that we obtained the 'labels_', how can I map the input instances to the labels_? Do they have same order?

Original Q&A

There are 2 answers

**user2737636** · Answer 1 · 2017-01-03T12:58:38+00:00

I've managed to get the 'labels_' attribute; however I still don't know if the order of resulting labels are same as the input instances or not.

temp_data = Spark DataFrame containing 'key' and 'features' columns, 
        where 'features' is a Vector.

ke = KeyedEstimator(sklearnEstimator=DBSCAN())
k_model = ke.fit(temp_data)

def getLabels(model):
    return model.estimator.labels_

labels_udf = udf(lambda x: getLabels(x).tolist(), ArrayType(IntegerType()))("estimator").alias("labels")
res_df = km_dbscan.keyedModels.select("key", labels_udf)

**eliasah** · Answer 2 · 2017-01-03T09:30:23+00:00

eliasah On 03 January 2017 at 09:30

It's just possible in the case of KMeans, in which we can predict cluster labels, since the scikit-learn estimator provides this functionality.

Unfortunately, this is not the case for some other clusterers, such as DBSCAN.

TechQA.

Clustering data using DBSCAN and spark_sklearn

There are 2 answers

Related Questions in APACHE-SPARK

Related Questions in SCIKIT-LEARN

Related Questions in PYSPARK

Related Questions in CLUSTER-ANALYSIS

Related Questions in DBSCAN

Popular Questions

Popular Tags

Trending Questions