I am learning Spark source code, and get confused on the following code:
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
What is the input data for the map(x => (x, null)) function? Why and when the input can be omitted?
UPDATE:
Here is the link to the source code.
distinct
andmap
are both methods on the RDD class (source), sodistinct
is just calling another method on the same RDD.The
map
function is a higher-order function - i.e. it accepts a function as one of its parameters (f: T => U
)In the case of
distinct
, the parameterf
tomap
is the anonymous functionx => (x, null)
.Here's a simple example of using an anonymous function (lambda), in the Scala REPL (using the similar
map
function on a Scala list, not a Spark RDD):