I am using Python 2.7. I have routes which are composed of arrays of nodes that connect to each other. The nodes are identified by a string key, but for ease I will use numbers:
sample_route = [1,2,3,4,7]
#obviously over-simplified; real things would be about 20-40 elements long
I will create a set
made up of tuple pairs of point connections using zip, which will end up like:
set([(1,2),(2,3),(3,4),(4,7)])
I will need a way to filter out some routes that are very similar (like one or two added nodes) and use the minimal route out of those near-duplicates. My plan right now is to:
Start with the first (probably optimal) route. Iterate through the rest of the routes, and use the following formula to calculate its similarity to the sequence in step 1:
matching = len(value1.difference(value2)) + len(value2.difference(value1))
#value1, value2 = two compared sets
The lower the number, the more similar. But what is a good way to group the routes based on their similarity to other routes? They will all be different lengths. I have never taken a statistics course.
Example:
sets = [
set([(1,2),(2,3),(3,4),(4,5),(5,10)]),
set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
set([(1,7),(7,3),(3,8),(8,7),(7,6),(6,5),(5,10)]),
set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
set([(1,9),(9,4),(4,5),(5,10)]),
set([(1,9),(9,4),(4,6),(6,5),(5,10)])
]
In this example, the groupings may be something like [[1,2,4],[3],[5,6]]
, where 1, 2, and 4 are very similar, 5 and 6 are similar, and 3 is nowhere near any of the others. 1 to 2 would have a score of 2, and 3 to 6 would have a score of 8, as examples. This is the sort of data I am using (though these are easy to read simplifications).
There is a time benefit in this. If I can remove redundant routes, I will trim off a considerable amount of time.
I would recommend looking into the networkx package. It allows you to create directed graphs such as you are describing. For measuring the similarity of 2 routes, I would recommend the Jaccard similarity index. Here is code showing the example you illustrated.
First, import a few libraries: graphs, plotting, and numeric python. Then build the directed graph by adding nodes numbered 1 to 8. Build the connections from node to node to build your path. The networkx package has built-in capability to find all paths in the graph from one node to another:
nx.all_simple_paths(g, start_node, end_node)
.Once you have all of the paths, you can compute the a matrix,
J
, of the Jaccard similarity between the paths. How you actually want to cluster the paths by their similarity is up to you.EDIT Since you have a metric to compare the similarity of the paths to each other, look into k-means clustering to cluster the paths together.
from scipy.cluster.vq import kmeans2
I don't have enough of your code or your data to help anymore from this point.