Spark - Combinations without repetition

1.8k views Asked by At

I am trying to do all lines combinations without repetition of a text file.

Example:

  1. 1
  2. 2
  3. 2
  4. 1
  5. 1

Result:

  • Line 1 with line 2 = (1,2)
  • Line 1 with line 3 = (1,2)
  • Line 1 with line 4 = (1,1)
  • Line 1 with line 5 = (1,1)
  • Line 2 with line 3 = (2,2)
  • Line 2 with line 4 = (2,1)
  • Line 2 with line 5 = (2,1)
  • Line 3 with line 4 = (2,1)
  • Line 3 with line 5 = (2,1)
  • Line 4 with line 5 = (1,1)

or

Considering (x,y), if (x != y) 0 else 1:

  • 0
  • 0
  • 1
  • 1
  • 1
  • 0
  • 0
  • 0
  • 0
  • 1

I have the following code:

def processCombinations(rdd: RDD[String]) = {
    rdd.mapPartitions({ partition => {
        var previous: String = null;
        if (partition.hasNext)
          previous = partition.next

        for (element <- partition) yield {
          if (previous == element)
            "1"
          else
            "0"
        }
      }
    })
  }

The piece of code above is doing the combinations of the first element of my RDD, in other words: (1,2) (1,2) (1,1) (1,1).

The problem is: This code ONLY works with ONE PARTITION. I'd like to make this work with many partitions, how could I do that?

2

There are 2 answers

4
The Archetypal Paul On

It's not very clear exactly what you want as output, but this reproduces your first example, and translates directly to Spark. It generates combinations, but only where the index of the first element in the original list is less than the index of the second, which is I think what you're asking for.

val r = List(1,2,2,1,1)
val z = r zipWithIndex

z.flatMap(x=>z.map(y=>(x,y))).collect{case(x,y) if x._2 < y._2 => (x._1, y._1)}
//List((1,2), (1,2), (1,1), (1,1), (2,2), (2,1), (2,1), (2,1), (2,1), (1,1))

or, as a for-comprehension

for (x<-z; y<-z; if x._2 < y._2) yield (x._1, y._1)
1
uranio On

This code calculate the combinations without repetitions by using recursion. It gets 2 arguments: number of elements for the combination and the list of elements.

It works in the following way: for the given list: 1, 2, 3, 4, 5 => It takes the 4 first elements for the first combination. Then It generates other combination with 5, the last element of the list. When there are not more elements left in the list, It moves one position back (third position) and takes the next element to generates more combinations from there: 1, 2, "4", 5. This operation is done recursively with all of elements of the list.

def combinator[A](n: Int, list: List[A], acc: List[A]): List[List[A]] = {
  if (n == 0)
    List(acc.reverse)
  else if (list == Nil)
    List()
  else
    combinator(n - 1, list.tail, list.head :: acc) ::: combinator(n, list.tail, acc)
}

combinator(4, List(1, 2, 3, 4, 5), List()).foreach(println)

// List(1, 2, 3, 4)
// List(1, 2, 3, 5)
// List(1, 2, 4, 5)
// List(1, 3, 4, 5)
// List(2, 3, 4, 5)