How to get list of graph nodes after using connectedComponents of pyspark

602 views Asked by At

I am learning PySpark in Python. If I use the below line of code to get components from my graph, then one column would be added to my GraphDataFrame with the component (random number). But I am curious is it possible to get a list of nodes that are connected?

g.connectedComponents()
1

There are 1 answers

2
Alex Ott On

result is just a normal data frame, that you can group by component, and then collect results as list using the collect_list function (doc). For example, using the example graph from graphframes:

from graphframes.examples import Graphs
import pyspark.sql.functions as F

sc.setCheckpointDir("/tmp/spark-checkpoint")

g = Graphs(sqlContext).friends()
df = g.connectedComponents()

# getting the list of IDs per component
df2 = df.select("id", "component").groupBy("component") \
  .agg(F.collect_list("id"))
df2.show()

will give:

+------------+------------------+
|   component|  collect_list(id)|
+------------+------------------+
|412316860416|[a, b, c, d, e, f]|
+------------+------------------+