Invoking a utility(external) inside Spark streaming job

659 views Asked by At

I have a streaming job consuming from Kafka (using createDstream). its stream of "id"

[id1,id2,id3 ..]

I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id

[id:t1,id2:t2,id3:t3...]

I want to retain the DStream while calling the utility to retain Dstream. I can't use map transformation on Dstream rdd as it will make a call for each id, and moreover the utility is accepting a collection of id's.

Dstream.map(x=> myutility(x)) -- ruled out

And if I use

Dstream.foreachrdd(rdd=> myutility(rdd.collect.toarray))

I lose the DStream. I need to retain DStream for downstream processing.

1

There are 1 answers

0
maasg On BEST ANSWER

The approach to achieve external bulk calls is to directly transform the RDDs in the DStream at the partition level.

The pattern looks like this:

val transformedStream = dstream.transform{rdd => 
    rdd.mapPartitions{iterator => 
      val externalService = Service.instance() // point to reserve local resources or make server connections.
      val data = iterator.toList // to act in bulk. Need to tune partitioning to avoid huge data loads at this level
      val resultCollection = externalService(data)
      resultCollection.iterator
    }
 }

This approach process each partition of the underlying RDD in parallel using the resources available in the cluster. Note that the connection to the external system needs to be instantiated for each partition (and not for each element).