I'm working with a largish (?) graph (60 million vertices and 9.5 billion edges) using Spark Graphframes. The underlying data is not large - the vertices take about 500mb on disk and the edges are about 40gb. My containers are frequently shutting down due to java heap out of memory problems, but I think the underlying problem is that the graphframe is constantly shuffling data around (I'm seeing shuffle read/write of up to 150gb). Is there a way to efficiently partition a Graphframe or the underlying edges/vertices to reduce shuffle?
Partitioning with Spark Graphframes
2.3k views Asked by John At
2
There are 2 answers
0
On
Here's the partial solution / workaround - create a UDF that mimics one of the partition functions to create a new column and partition on that.
num_parts = 256
random_vertex_cut = udf.register("random_vertex_cut", lambda src, dst: math.abs((src, dst).hashCode()) % num_parts, IntegerType())
edge.withColumn("v_cut", random_vertex_cut(col("src"), col("dst")).repartition(256, "v_cut")
This approach can help some, but not as well as GraphX.
TL;DR It is not possible to efficiently partition
Graphframe
.Graphframe
algorithms can be separated into two categories:Methods which delegate processing to
GraphX
counterpart. GraphX supports a number of partitioning methods but these are not exposed viaGraphframe
API. If you use one of these it is probably better to useGraphX
directly.Unfortunately development of
GraphX
stopped almost completely with only a handful of small fixes over the last two years and overall performance is highly disappointing compared to both in-core libraries and out-of-core libraries.Methods which are implemented natively using Spark
Datasets
, which considering limited programming model and only a single partitioning mode, are deeply unfit for complex graph processing.While relational columnar storage can be used for efficient graph processing naive iterative
join
approach employed byGraphframes
just don't scale (but it is OK for shallow traversing with one or two hops).'You can try to repartition
vertices
andedges
DataFrames
byid
andsrc
respectively:what should help in some cases.
Overall, at it's current (Dec, 2016) state, Spark is not a good choice for intensive graph analytics.