I have a dataframe with two column vertex and weight

----------------
vertex| weight
----------------
a     | w1
b     | w2
..    | ...
x     | wz
----------------

Im looking for computing similarity between every two vertex. In another words, Im looking for a new dataframe:

   -------------------------
    vertex1| vertex2| weight
    ------------------------
    a     | b       | w1+w2
    a     | c       | w1+w3
    ..    | ...
    a     | x       | w1+wx
    b     | a       | w2+w1
    b     | c       | w2+w3
    ....  
    -----------------------

any suggestion to do that plz?

1 Answers

1
OmG On Best Solutions

A simple solution is join the dataframe with itself on the constraint that the vertex is different. A naive implementation could be liked the following:

df1 = df.select(col("vertex").alias("vertex1"), col("weight").alias("weight1"))
df2 = df.select(col("vertex").alias("vertex2"), col("weight").alias("weight2"))
result =  df1.join(df2, col('vertex1') != col('vertex2'))\
             .withColumn('weight', df1['weight1'] + df2['weight2'])\
             .select(col('vertex1'), col('vertex2'), col('weight))