I want to visualize similarity of text documents for which I am using scikit-learn's TfidfVectorizer as tfidf = TfidfVectorizer(decode_error='ignore', max_df=3).fit_transform(data)
and then performing cosine similarity calculation as cosine_similarity = (tfidf*tfidf.T).toarray()
which gives similarity but sklearn.manifold.MDS
needs a dissimilarity matrix. When I give 1-cosine_similarity, the diagonal values which should be zero, are not zero. They are some small value like 1.12e-9
etc. Two questions:
1) How do I use similarity matrix for MDS or how do I change my similarity matrix to dissimilarity matrix?
2) In MDS, there is an option dissimilarity
, the values of which can be 'precomputed'
or 'euclidean'
. What's the difference between the two because when I give euclidean, the MDS coordinates come to be same regardless of whether I use cosine_similarity or 1-cosine_similarity which looks wrong.
Thanks!
I do not really understand your cosine transformation (as I see no cosine/angle/normalized scalar product being involved) and I do not know the TfidfVectorizer functionality but I will try to answer your two questions:
1) Generally the (dissimilarity = 1-similarity)-approach is valid for cases in which all the entries in the matrix are between -1 and 1. Assuming the distance matrix d = cosine_similarity is a such a symmetric distance matrix up to numerical artefacts you can apply
to correct for the artefacts. The same operation can be needed when using numpys corrcoef(X) to create a dissimilarity matrix based on Pearson correlation coefficients. Two side nodes: 1. For non-bounded similarity measures you can still come up with equivalent approaches. 2. In case of the use for MDS you might consider using a measure which is closer to euclidean distance (and not bounded) as this would be a more natural choice for MDS and lead to better results.
2) Using the 'precomputed' option assumes that you feed the .fit(X=dissimilarity matrix)-method of MDS with a dissimilarity matrix that you precomputed (your scenario). Using dissimilarity = 'euclidean' instead would compute the euclidean distance matrix of the data that you pass to .fit(X=data).
Hope this helps!