I want to cluster my input data using DBSCAN and spark_sklearn. I'd like to get the labels of each input instance after clustering. Is it possible?
Reading the documentation on http://pythonhosted.org/spark-sklearn, I tried the following:
temp_data = Spark DataFrame containing 'key' and 'features' columns,
where 'features' is a Vector.
ke = KeyedEstimator(sklearnEstimator=DBSCAN(), estimatorType="clusterer")
print ke.getOrDefault("estimatorType") --> "clusterer"
ke.fit_pedict(temp_data) --> ERROR: 'KeyedEstimator' object has no attribute 'fit_predict'
k_model = ke.fit(temp_data)
print k_model.getOrDefault("estimatorType") --> "clusterer"
k_model.fit_pedict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'fit_predict'
k_model.predict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'predict'
k_model.transform(temp_data) --> ERROR: estimatorType assumed to be a clusterer, but sklearnEstimator is missing fit_predict()
(NOTE: sklearn.cluster.DBSCAN actually have fit_predict() method)
What I normally do using sklearn (without spark) is to fit (dbscan_model.fit(temp_data-features)
) and get labels from the model (labels = dbscan_model.labels_
). It is also fine if I can get the 'labels_' attribute using spark-sklearn.
If the above-mentioned calls ('transform' or 'predict') doesn't work, is it possible to get the 'labels_' after fitting data using spark-sklearn? How can I do that? Assuming that we obtained the 'labels_', how can I map the input instances to the labels_? Do they have same order?
I've managed to get the 'labels_' attribute; however I still don't know if the order of resulting labels are same as the input instances or not.