r00 = sc.parallelize(range(9))
r01 = sc.parallelize(range(0,90,10))
r10 = r00.cartesian(r01)
r11 = r00.map(lambda n : (n, n))
r12 = r00.zip(r01)
r13 = r01.keyBy(lambda x : x / 20)
r20 = r11.union(r12).union(r13).union(r10)
r20.collect()
The previous pyspark block code gives the following Job DAG:
But the stage DAG of the job is showing several PythonRDD
from the ParallelCollectionRDD
even if they are the same (for instance ParallelCollectionRDD [0]
has PythonRDD [2]
, PythonRDD [5]
and PythonRDD [8]
.
Why the PythonRDD
are present ? Why not a direct connection from ParallelCollectionRDD
to UnionRDD
, ZippedPartitionRDD
and CartesianRDD
?