There are similar questions and libraries like ELI5 and LIME. But I couldn't find a solution to my problem. I have a set of documents and I am trying to cluster them using scikit-learn's DBSCAN. First, I am using TfidfVectorizer to vectorize the documents. Then, I simply cluster the data and receive the predicted labels. My question is: How can I explain the reason why a cluster has formed? I mean, imagine there are 2 predicted clusters (cluster 1 and cluster 2). Which features (since our input data is vectorized documents, our features are vectorized "words") are important for the creation of the cluster 1 (or cluster 2)?
Below you can find a minimal example of what I am currently working on. This is not a minimal working example of what I am trying to achieve (since I don't know how).
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
shuffle=True,
random_state=42,
remove=('headers', 'footers'),
)
visualize_train_data = pd.DataFrame(data=np.c_[twenty_train
['data'], twenty_train
['target']])
print(visualize_train_data.head())
vec = TfidfVectorizer(min_df=3, stop_words='english',
ngram_range=(1, 2))
vectorized_train_data = vec.fit_transform(twenty_train.data)
clustering = DBSCAN(eps=0.6, min_samples=2).fit(vectorized_train_data)
print(f"Unique labels are {np.unique(clustering.labels_)}")
Side notes: The question I provided focuses on specifically the k-Means algorithm and the answer isn't very intuitive (for me). ELI5 and LIME are great libraries but the examples provided by them are either regression or classification related (not clustering) and their regressors and classifiers support "predict" directly. DBSCAN doesn't...
First, let's understand what is the embedding space you work with.
TfidfVectorizer
will create a very sparse matrix one dimension of which correspond to the sentences, and the other to your vocabulary (all the words in text, besides "stop words" and very uncommon - seemin_df
andstop_words
). When you ask DBSCAN to cluster sentences, it takes those representations of words tf-idfs, and finds sentences which are close to each other using euclidian distance metric. So your clusters hopefully should be created out of sentences which have common words. In order to find which words (or "features") are most important in the specific cluster, just take the sentences that belong to the same cluster (rows of the matrix), and find top K (say ~10) indices of the columns that have most common non-zero values. Then lookup what those words are usingvec.get_feature_names()
update
Please note that clusters you get here are really small. The cluster 55 has only 4 sentences. Most of the others have only 2 sentences.