How to explain text clustering result by feature importance? (DBSCAN)

1.5k views Asked by At

There are similar questions and libraries like ELI5 and LIME. But I couldn't find a solution to my problem. I have a set of documents and I am trying to cluster them using scikit-learn's DBSCAN. First, I am using TfidfVectorizer to vectorize the documents. Then, I simply cluster the data and receive the predicted labels. My question is: How can I explain the reason why a cluster has formed? I mean, imagine there are 2 predicted clusters (cluster 1 and cluster 2). Which features (since our input data is vectorized documents, our features are vectorized "words") are important for the creation of the cluster 1 (or cluster 2)?

Below you can find a minimal example of what I am currently working on. This is not a minimal working example of what I am trying to achieve (since I don't know how).

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers'),
)

visualize_train_data = pd.DataFrame(data=np.c_[twenty_train
                                                      ['data'], twenty_train
                                                      ['target']])
print(visualize_train_data.head())

vec = TfidfVectorizer(min_df=3, stop_words='english',
                      ngram_range=(1, 2))
vectorized_train_data = vec.fit_transform(twenty_train.data)

clustering = DBSCAN(eps=0.6, min_samples=2).fit(vectorized_train_data)
print(f"Unique labels are {np.unique(clustering.labels_)}")

Side notes: The question I provided focuses on specifically the k-Means algorithm and the answer isn't very intuitive (for me). ELI5 and LIME are great libraries but the examples provided by them are either regression or classification related (not clustering) and their regressors and classifiers support "predict" directly. DBSCAN doesn't...

2

There are 2 answers

2
igrinis On

First, let's understand what is the embedding space you work with. TfidfVectorizer will create a very sparse matrix one dimension of which correspond to the sentences, and the other to your vocabulary (all the words in text, besides "stop words" and very uncommon - see min_df and stop_words). When you ask DBSCAN to cluster sentences, it takes those representations of words tf-idfs, and finds sentences which are close to each other using euclidian distance metric. So your clusters hopefully should be created out of sentences which have common words. In order to find which words (or "features") are most important in the specific cluster, just take the sentences that belong to the same cluster (rows of the matrix), and find top K (say ~10) indices of the columns that have most common non-zero values. Then lookup what those words are using vec.get_feature_names()

update

cluster_id = 55   # select some cluster
feat_freq = (vectorized_train_data[(clustering.labels_== cluster_id)] > 0).astype(int).sum(axis=0)  # find frequencies of the words
max_idx = np.argwhere(feat_freq == feat_freq.max())[:,1]  # select those with maximum frequency
for i in max_idx:
    print(i, vec.get_feature_names()[i])
    

Please note that clusters you get here are really small. The cluster 55 has only 4 sentences. Most of the others have only 2 sentences.

0
nad_rom On

DBSCAN, as most of clustering algorithms in sklearn, doesn't provide you predict method or feature importances. So you can either (1) reconstruct the decision process by training logistic regression or whatever else interpretable classifier using cluster labels, or (2) switch to another text clustering method, such as NMF or LDA. The first approach is exactly what Lime and the likes do.