I'm using a similarity matrix with values between 0 and 1 (1 means that the elements are equals), and I'm trying to plot a MDS with python and scikit-learn.
I found multiple examples, but I'm not sure about what to give as an input to mds.fit().
For now, my data looks like that (file.csv) :
; A ; B ; C ; D ; E
A ; 1 ; 0.1 ; 0.2 ; 0.5 ; 0.2
B ; 0.1 ; 1 ; 0.3 ; 1 ; 0
C ; 0.2 ; 0.3 ; 1 ; 0.8 ; 0.6
D ; 0.5 ; 1 ; 0.8 ; 1 ; 0.2
E ; 0.2 ; 0 ; 0.6 ; 0.2 ; 1
I'm currently using this code :
import pandas
from sklearn import manifold
import matplotlib.pyplot as plt
data = pandas.read_table("file.csv", ";", header=0, index_col=0)
mds = manifold.MDS(n_components=2, random_state=1, dissimilarity="precomputed")
mds.fit(data)
points = mds.embedding_
# Prepare axes
ax = plt.axes([0,0,2,2])
ax.set_aspect(aspect='equal')
# Plot points
plt.scatter(points[:,0], points[:,1], color='silver', s=150)
# Add labels
for i in range(data.shape[0]):
ax.annotate(data.index[i], (points[i,0], points[i,1]), color='blue')
#plt.show() # Open display and show at screen
plt.savefig('out.png', format='png', bbox_inches='tight') # PNG
#plt.savefig('out.jpg', format='jpg', bbox_inches='tight') # JPG
I'm unsure about what sklearn is doing. I read a lot of examples where people are using "dissimilarity matrix" with 0 in the middles (instead of 1).
Should I make a transformation ? Or not ? If yes, which transformation should be done ? (I read there that a simple substraction is enough... but other methods exist... I'm a bit lost :( )
Does sklearn and MDS automatically understand the input ? (as a similarity or dissimilarity matrix with the 0 or 1 in the middle ?) Or does it use a distance matrix ? (In this case, how to obtain it from a similarity matrix ?)
In this link, they say the similarity are between 1 and -1... I'm using similarities between 0 and 1... I suppose I should transform my data? Which transformation should be used?
I did a comparison with XLSTAT (an excel extension) in order to try a lot of scenarios and compare how to do what.
First : my input matrix was a "similarity" matrix, because I could interpret it as: "A and A are 100% equal". As MDS is taking a dissimilarity matrix as input, I must apply a transformation.
similarity-to-dissimilarity-simple.awk
It can be called using the command :
This method seems to be perfectly adapted if the diagonal does not contain the same value (I saw there a co-occurrence matrix... it should apply to his cas). In my case, as the diagonal is ALWAYS full of 1, I reduced it to :
The AWK program to make this transformation (I implemented the simplified one, because of my data) is therefore :
similarity-to-dissimilarity-complex.awk
And you can call it with this command :
When I used the Kruskal's stress to check which version was better... in my case, the simple similarity to dissimilarity (1 - cell) was the best (I kept a stress between 0,34 and 0,32... which is not good... where the complex shows bigger values than 0,34, which is worse).