Linked Questions

Popular Questions

Duplicate RDDs when union of RDD from same sources

Asked by At
r00 = sc.parallelize(range(9))
r01 = sc.parallelize(range(0,90,10))
r10 = r00.cartesian(r01)
r11 = n : (n, n))
r12 =
r13 = r01.keyBy(lambda x : x / 20)
r20 = r11.union(r12).union(r13).union(r10)

The previous pyspark block code gives the following Job DAG:


But the stage DAG of the job is showing several PythonRDD from the ParallelCollectionRDD even if they are the same (for instance ParallelCollectionRDD [0] has PythonRDD [2], PythonRDD [5] and PythonRDD [8].

Stage DAG

Why the PythonRDD are present ? Why not a direct connection from ParallelCollectionRDD to UnionRDD, ZippedPartitionRDD and CartesianRDD ?

Related Questions