Use similarity matrix instead of dissimilarity matrix for MDS in scikit-learn

3k views Asked by At

I want to visualize similarity of text documents for which I am using scikit-learn's TfidfVectorizer as tfidf = TfidfVectorizer(decode_error='ignore', max_df=3).fit_transform(data)

and then performing cosine similarity calculation as cosine_similarity = (tfidf*tfidf.T).toarray()

which gives similarity but sklearn.manifold.MDS needs a dissimilarity matrix. When I give 1-cosine_similarity, the diagonal values which should be zero, are not zero. They are some small value like 1.12e-9 etc. Two questions:

1) How do I use similarity matrix for MDS or how do I change my similarity matrix to dissimilarity matrix?

2) In MDS, there is an option dissimilarity, the values of which can be 'precomputed' or 'euclidean'. What's the difference between the two because when I give euclidean, the MDS coordinates come to be same regardless of whether I use cosine_similarity or 1-cosine_similarity which looks wrong.

Thanks!

1

There are 1 answers

0
Jojo On

I do not really understand your cosine transformation (as I see no cosine/angle/normalized scalar product being involved) and I do not know the TfidfVectorizer functionality but I will try to answer your two questions:

1) Generally the (dissimilarity = 1-similarity)-approach is valid for cases in which all the entries in the matrix are between -1 and 1. Assuming the distance matrix d = cosine_similarity is a such a symmetric distance matrix up to numerical artefacts you can apply

dissimilarity_clean = 1 - np.triu(d)+np.triu(d).T-np.diag(np.ones(len(d)))

to correct for the artefacts. The same operation can be needed when using numpys corrcoef(X) to create a dissimilarity matrix based on Pearson correlation coefficients. Two side nodes: 1. For non-bounded similarity measures you can still come up with equivalent approaches. 2. In case of the use for MDS you might consider using a measure which is closer to euclidean distance (and not bounded) as this would be a more natural choice for MDS and lead to better results.

2) Using the 'precomputed' option assumes that you feed the .fit(X=dissimilarity matrix)-method of MDS with a dissimilarity matrix that you precomputed (your scenario). Using dissimilarity = 'euclidean' instead would compute the euclidean distance matrix of the data that you pass to .fit(X=data).

Hope this helps!