I am trying to do all lines combinations without repetition of a text file.
Example:
- 1
- 2
- 2
- 1
- 1
Result:
- Line 1 with line 2 = (1,2)
- Line 1 with line 3 = (1,2)
- Line 1 with line 4 = (1,1)
- Line 1 with line 5 = (1,1)
- Line 2 with line 3 = (2,2)
- Line 2 with line 4 = (2,1)
- Line 2 with line 5 = (2,1)
- Line 3 with line 4 = (2,1)
- Line 3 with line 5 = (2,1)
- Line 4 with line 5 = (1,1)
or
Considering (x,y), if (x != y) 0 else 1:
- 0
- 0
- 1
- 1
- 1
- 0
- 0
- 0
- 0
- 1
I have the following code:
def processCombinations(rdd: RDD[String]) = {
rdd.mapPartitions({ partition => {
var previous: String = null;
if (partition.hasNext)
previous = partition.next
for (element <- partition) yield {
if (previous == element)
"1"
else
"0"
}
}
})
}
The piece of code above is doing the combinations of the first element of my RDD, in other words: (1,2) (1,2) (1,1) (1,1).
The problem is: This code ONLY works with ONE PARTITION. I'd like to make this work with many partitions, how could I do that?
It's not very clear exactly what you want as output, but this reproduces your first example, and translates directly to Spark. It generates combinations, but only where the index of the first element in the original list is less than the index of the second, which is I think what you're asking for.
or, as a for-comprehension