I am using the Apache Spark-GraphFrames using Scala in the following Code, I am applying the BFS on above code and try to find the distance between Vertice 0 to 100.
import org.apache.spark._
import org.graphframes._
import org.graphframes.GraphFrame
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SQLContext
object SimpApp{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SimpApp")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val nodesList = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("CSV File Path")
val edgesList= sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("CSV File Path")
val v=nodesList.toDF("id")
val e=edgesList.toDF("src", "dst", "dist")
val g = GraphFrame(v, e)
var paths: DataFrame = g.bfs.fromExpr("id = 0").toExpr(s"id = 100").maxPathLength(101).run()
paths.show()
sc.stop()
}
}
Soucre Node:0 Destination Node:100
Vertices List is Given below
id
0
1
2
3
.
.
.
up to
1000
Here is the Edges List
src dst dist
0 1 2
1, 2, 1
2, 3, 5
3, 4, 1
4, 5, 3
5, 6, 3
6, 7, 6
. . .
. . .
. . .
up to
999, 998, 4
But Problem with above-given code is, it taking a large amount of time just for execution for 0 to100 vertice, Since It was running 4 hours but there is no output. Above code I am running on Single Machine having 12 GB RAM.
Can you please Guide me to speed up and optimize the code.
To validate, I take it that you are trying to find the shortest distance for the unweighted edges of your graph hence the use of BFS. In that cases, you may want to remove the
maxPathLength(101)
from your query so its:As noted in the BFS definition:
By specifying 101 between vertex 0 and vertex 100, it will try to find any and all edges from 0 to 100 that have a length of 101 hence the large number of iterations.
A fun example of BFS and the shortest distance can be described in the classic graph scenario concerning flights (ref: On-Time Flight Performance with GraphFrames for Apache Spark) where the vertexes (or nodes) are airports while the edges are the flights between those airports.
If you're trying to find a non-stop flight between
SFO
(San Francisco) andBUF
(Buffalo), the BFS query would be:which as noted in the referenced link, there are no direct flights hence no results. But if you increase the
maxPathLength
to 2 (i.e. i.e. one additional node between theSFO
andBUF
nodes), then you would find a number of path (e.g.SFO
>BOS
>BUF
or San Francisco to Boston to Buffalo)