Apache-Spark Graph-frame is very slow on BFS

2.2k views Asked by At

I am using the Apache Spark-GraphFrames using Scala in the following Code, I am applying the BFS on above code and try to find the distance between Vertice 0 to 100.

import org.apache.spark._
import org.graphframes._
import org.graphframes.GraphFrame
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SQLContext
object SimpApp{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SimpApp")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val nodesList = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("CSV File Path")
val edgesList= sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("CSV File Path")
val v=nodesList.toDF("id")
val e=edgesList.toDF("src", "dst", "dist")
val g = GraphFrame(v, e)
var paths: DataFrame = g.bfs.fromExpr("id = 0").toExpr(s"id = 100").maxPathLength(101).run()  
paths.show()
sc.stop()
}
}

Soucre Node:0 Destination Node:100

Vertices List is Given below

id
0
1
2
3
.
.
.
up to
1000

Here is the Edges List

src dst dist
0    1   2
1,   2,   1
2,   3,   5 
3,   4,   1
4,   5,   3
5,   6,   3
6,   7,   6
.    .   .
.    .   .
.    .   .
up to
999, 998, 4

But Problem with above-given code is, it taking a large amount of time just for execution for 0 to100 vertice, Since It was running 4 hours but there is no output. Above code I am running on Single Machine having 12 GB RAM.

Can you please Guide me to speed up and optimize the code.

1

There are 1 answers

0
Denny Lee On BEST ANSWER

To validate, I take it that you are trying to find the shortest distance for the unweighted edges of your graph hence the use of BFS. In that cases, you may want to remove the maxPathLength(101) from your query so its:

g.bfs.fromExpr("id = 0").toExpr("id = 100").run() 

As noted in the BFS definition:

maxPathLength is the limit on the length of paths with a default of 10. If no valid paths of length <= maxPathLength are found, then the BFS is terminated.

By specifying 101 between vertex 0 and vertex 100, it will try to find any and all edges from 0 to 100 that have a length of 101 hence the large number of iterations.

A fun example of BFS and the shortest distance can be described in the classic graph scenario concerning flights (ref: On-Time Flight Performance with GraphFrames for Apache Spark) where the vertexes (or nodes) are airports while the edges are the flights between those airports.

If you're trying to find a non-stop flight between SFO (San Francisco) and BUF (Buffalo), the BFS query would be:

tripGraph.bfs.fromExpr("id = 'SFO'").toExpr("id = 'BUF').maxPathLength(1).run

which as noted in the referenced link, there are no direct flights hence no results. But if you increase the maxPathLength to 2 (i.e. i.e. one additional node between the SFO and BUF nodes), then you would find a number of path (e.g. SFO > BOS > BUF or San Francisco to Boston to Buffalo)

tripGraph.bfs.fromExpr("id = 'SFO'").toExpr("id = 'BUF').maxPathLength(2).run