I have a streaming job consuming from Kafka (using createDstream
).
its stream of "id"
[id1,id2,id3 ..]
I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id
[id:t1,id2:t2,id3:t3...]
I want to retain the DStream
while calling the utility to retain Dstream. I can't use map transformation on Dstream rdd as it will make a call for each id, and moreover the utility is accepting a collection of id's.
Dstream.map(x=> myutility(x)) -- ruled out
And if I use
Dstream.foreachrdd(rdd=> myutility(rdd.collect.toarray))
I lose the DStream
. I need to retain DStream
for downstream processing.
The approach to achieve external bulk calls is to directly transform the RDDs in the DStream at the partition level.
The pattern looks like this:
This approach process each partition of the underlying RDD in parallel using the resources available in the cluster. Note that the connection to the external system needs to be instantiated for each partition (and not for each element).