Clustering on Graph (using Boost Graph Library)

2.1k views Asked by At

In a C++ project, we are trying on an important Boost Graph traffic-related to launch several simulations of Dijkstra for the shortest path between two nodes.

The graph has 18 000 vertices and 40 000 edges. Loading the graph takes roughly 200ms, and a Dijkstra run 50ms.

But time start to be an issue, so we want to lower those times. We are looking at several options :

  • Use heuristics of the Dijkstra algorithm like :

    • Bi-directional Dijkstra
    • A* search
  • Pre-processing the graph like clustering operations, to reduce the number of node and vertices loaded, and thus reducing the loading time :

    • Hierarchical clustering
    • Markov cluster algorithm

So the question is in two parts :

  • What is the best/easier method to implemented a graph clustering ? (if it is using the boost library, it would be easier to implement for us). With references, examples of code that could use ?

  • What is the best Dijsktra-like algorithm to use in this kind of scenario ?

If you have any information about those two inquiries, it will be much appreciated.

Thank you.

1

There are 1 answers

2
pbible On

Clustering

There are many ways to cluster nodes in a graph. Any distance metric for node representations can be used for clustering. Boost doesn't have out of the box clustering support other than in a few limited cases such as betweenness clustering.

The Micans package has a very simple and fast program for markov clustering. This can be easily called from the command line. Markov clustering (MCL) works by repeated matrix multiplication to simulate random walks. It might not be hard to implement if you change your graph to a matrix representation.

If your a graph is not changing then clustering could be performed offline. R is a much better environment for clustering because many methods are already implemented but you must provide distance values.

Do you need clustering?

Will clustering help your shortest path calculations? My intuition (often wrong) leads me to believe that clustering would result in only local optima. An optimal solution on a subset of your graph does not guarantee that it would be useful for the global solution. The traffic problem makes me think of flow algorithms such as boykov_kolmogorov_max_flow.

Other suggestions

I'm not sure of your set up. If you are running 1 simulation per process then the load time could be an issue. You could trying running more than one simulation per process so your load time isn't repeated as often.

For example, currently you have 200ms for load and 50ms for Dijkstra. If you run two back to back you have 500ms for two runs.

You could load the graph in 200ms, then run 6 simulations on the same process back to back. That way you have 200ms for load, then 50ms x 6 simulations. This gives you a 3x increase in number of simulations you can run even without parallel processing.