I want to apply a function via flatMap
to each group produced by DataSet.groupBy
. Trying to call flatMap
I get the compiler error:
error: value flatMap is not a member of org.apache.flink.api.scala.GroupedDataSet
My code:
var mapped = env.fromCollection(Array[(Int, Int)]())
var groups = mapped.groupBy("myGroupField")
groups.flatMap( myFunction: (Int, Array[Int]) => Array[(Int, Array[(Int, Int)])] ) // error: GroupedDataSet has no member flatMap
Indeed, in the documentation of flink-scala 0.9-SNAPSHOT no map
or similar is listed. Is there a similar method to work with? How to achieve the desired distributed mapping over each group individually on a node?
You can use
reduceGroup(GroupReduceFunction f)
to process all elements a group. AGroupReduceFunction
gives you anIterable
over all elements of a group and anCollector
to emit an arbitrary number of elements.Flink's
groupBy()
function does not group multiple elements into a single element, i.e., it does not convert a group of(Int, Int)
elements (that all share the same_1
tuple field) into one(Int, Array[Int])
. Instead, aDataSet[(Int, Int)]
is logically grouped such that all elements that have the same key can be processed together. When you apply aGroupReduceFunction
on aGroupedDataSet
, the function will be called once for each group. In each call all elements of a group are handed together to the function. The function can then process all elements of the group and also convert a group of(Int, Int)
elements into a single(Int, Array[Int])
element.