How to group sets by similarity in contained elements

1.4k views Asked by At

I am using Python 2.7. I have routes which are composed of arrays of nodes that connect to each other. The nodes are identified by a string key, but for ease I will use numbers:

sample_route = [1,2,3,4,7] 
#obviously over-simplified; real things would be about 20-40 elements long

I will create a set made up of tuple pairs of point connections using zip, which will end up like:

set([(1,2),(2,3),(3,4),(4,7)])

I will need a way to filter out some routes that are very similar (like one or two added nodes) and use the minimal route out of those near-duplicates. My plan right now is to:

Start with the first (probably optimal) route. Iterate through the rest of the routes, and use the following formula to calculate its similarity to the sequence in step 1:

matching = len(value1.difference(value2)) + len(value2.difference(value1))
#value1, value2 = two compared sets

The lower the number, the more similar. But what is a good way to group the routes based on their similarity to other routes? They will all be different lengths. I have never taken a statistics course.

Example:

sets = [
  set([(1,2),(2,3),(3,4),(4,5),(5,10)]),
  set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
  set([(1,7),(7,3),(3,8),(8,7),(7,6),(6,5),(5,10)]),
  set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
  set([(1,9),(9,4),(4,5),(5,10)]),
  set([(1,9),(9,4),(4,6),(6,5),(5,10)])
]

In this example, the groupings may be something like [[1,2,4],[3],[5,6]], where 1, 2, and 4 are very similar, 5 and 6 are similar, and 3 is nowhere near any of the others. 1 to 2 would have a score of 2, and 3 to 6 would have a score of 8, as examples. This is the sort of data I am using (though these are easy to read simplifications).

There is a time benefit in this. If I can remove redundant routes, I will trim off a considerable amount of time.

1

There are 1 answers

2
James On BEST ANSWER

I would recommend looking into the networkx package. It allows you to create directed graphs such as you are describing. For measuring the similarity of 2 routes, I would recommend the Jaccard similarity index. Here is code showing the example you illustrated.

First, import a few libraries: graphs, plotting, and numeric python. Then build the directed graph by adding nodes numbered 1 to 8. Build the connections from node to node to build your path. The networkx package has built-in capability to find all paths in the graph from one node to another: nx.all_simple_paths(g, start_node, end_node).

Once you have all of the paths, you can compute the a matrix, J, of the Jaccard similarity between the paths. How you actually want to cluster the paths by their similarity is up to you.

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

g = nx.DiGraph()
g.add_nodes_from(range(1,8))
g.add_edges_from([(1,2),(2,3),(3,4),(4,7)]) #path 1,2,3,4,7
g.add_edges_from([(4,5),(5,7)])             #path 1,2,3,4,5,7
g.add_edges_from([(4,6),(6,7)])             #path 1,2,3,4,6,7
paths_iter = nx.all_simple_paths(g,1,7)
paths = [p for p in paths]

np.random.seed(100000)
nx.draw_spring(g, with_labels=True)
plt.show()

def jaccard(v1, v2):
    return (len(np.intersect1d(v1,v2))+0.0)/len(np.union1d(v1,v2))

J = np.zeros([len(paths),len(paths)])
for i in range(J.shape[0]):
    for j in range(i, J.shape[1]):
        J[i,j] = J[j,i] = jaccard(paths[i],paths[j])

print J

> [[ 1.          0.71428571  0.83333333]
>  [ 0.71428571  1.          0.83333333]
>  [ 0.83333333  0.83333333  1.        ]]

graph paths

EDIT Since you have a metric to compare the similarity of the paths to each other, look into k-means clustering to cluster the paths together.

from scipy.cluster.vq import kmeans2

I don't have enough of your code or your data to help anymore from this point.