Invoking a utility(external) inside Spark streaming job

Question

Invoking a utility(external) inside Spark streaming job

669 views Asked by Rushabh Mehta At 06 January 2017 at 14:10

I have a streaming job consuming from Kafka (using createDstream). its stream of "id"

[id1,id2,id3 ..]

I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id

[id:t1,id2:t2,id3:t3...]

I want to retain the DStream while calling the utility to retain Dstream. I can't use map transformation on Dstream rdd as it will make a call for each id, and moreover the utility is accepting a collection of id's.

Dstream.map(x=> myutility(x)) -- ruled out

And if I use

Dstream.foreachrdd(rdd=> myutility(rdd.collect.toarray))

I lose the DStream. I need to retain DStream for downstream processing.

Original Q&A

There are 1 answers

**maasg** · Accepted Answer · 2017-01-06T15:07:52+00:00

The approach to achieve external bulk calls is to directly transform the RDDs in the DStream at the partition level.

The pattern looks like this:

val transformedStream = dstream.transform{rdd => 
    rdd.mapPartitions{iterator => 
      val externalService = Service.instance() // point to reserve local resources or make server connections.
      val data = iterator.toList // to act in bulk. Need to tune partitioning to avoid huge data loads at this level
      val resultCollection = externalService(data)
      resultCollection.iterator
    }
 }

This approach process each partition of the underlying RDD in parallel using the resources available in the cluster. Note that the connection to the external system needs to be instantiated for each partition (and not for each element).

TechQA.

Invoking a utility(external) inside Spark streaming job

There are 1 answers

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in SPARK-STREAMING

Related Questions in RDD

Related Questions in DSTREAM

Popular Questions

Popular Tags

Trending Questions