Spark grouping and then sorting (Java code)

798 views Asked by At

I have a JavaPairRDD and need to group by the key and then sort it using a value inside the object MyObject.

Lets say MyObject is:

class MyObject {
    Integer order;
    String name;
}

Sample data:

1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}
2, {order:1, name:'Alfred'}
2, {order:3, name:'Ana'}
2, {order:2, name:'Jessica'}
3, {order:3, name:'Will'}
3, {order:2, name:'Mariah'}
3, {order:1, name:'Monika'}

Expected result:

Partition 1:

1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}

Partition 2

2, {order:1, name:'Alfred'}
2, {order:2, name:'Jessica'}
2, {order:3, name:'Ana'}

Partition 3:

3, {order:1, name:'Monika'}
3, {order:2, name:'Mariah'}
3, {order:3, name:'Will'}

I'm using the key to partition the RDD and then using MyObject.order to sort the data inside the partition.

My goal is to get only the k-first elements in each sorted partition and then reduce them to a value calculated by other MyObject attribute (AKA "the first N best of the group").

How can I do this?

1

There are 1 answers

7
T. Gawęda On BEST ANSWER

You can use mapPartitions:

JavaPairRDD<Long, MyObject> sortedRDD = rdd.groupBy(/* the first number */)
    .mapPartitionsToPair(x -> {
        List<Tuple2<Long, MyObject>> values = toArrayList(x);
        Collections.sort(values, (x, y) -> x._2.order - y._2.order);

        return values.iterator();
     }, true);

Two highlights:

  • toArrayList takes an Iterator and returns ArrayList. You must implement it by yourself
  • important is to have true as the second argument of mapPartitionsToPair, because it will preserve partitioning