How to flatMap a function on GroupedDataSet in Apache Flink

1.4k views Asked by At

I want to apply a function via flatMap to each group produced by DataSet.groupBy. Trying to call flatMap I get the compiler error:

error: value flatMap is not a member of org.apache.flink.api.scala.GroupedDataSet

My code:

var mapped = env.fromCollection(Array[(Int, Int)]())
var groups = mapped.groupBy("myGroupField")
groups.flatMap( myFunction: (Int, Array[Int]) => Array[(Int, Array[(Int, Int)])] )  // error: GroupedDataSet has no member flatMap

Indeed, in the documentation of flink-scala 0.9-SNAPSHOT no map or similar is listed. Is there a similar method to work with? How to achieve the desired distributed mapping over each group individually on a node?

1

There are 1 answers

1
Fabian Hueske On BEST ANSWER

You can use reduceGroup(GroupReduceFunction f) to process all elements a group. A GroupReduceFunction gives you an Iterable over all elements of a group and an Collector to emit an arbitrary number of elements.

Flink's groupBy() function does not group multiple elements into a single element, i.e., it does not convert a group of (Int, Int) elements (that all share the same _1 tuple field) into one (Int, Array[Int]). Instead, a DataSet[(Int, Int)] is logically grouped such that all elements that have the same key can be processed together. When you apply a GroupReduceFunction on a GroupedDataSet, the function will be called once for each group. In each call all elements of a group are handed together to the function. The function can then process all elements of the group and also convert a group of (Int, Int) elements into a single (Int, Array[Int]) element.