I am currently working on a small spark job to compute stock correlation matrix from a DStream.
From a DStream[(time, quote)], I need to aggregate quotes (double) by time (Long) among multiple rdds, before computing correlations (considering all quotes of the rdds)
dstream.reduceByKeyAndWindow{./*aggregate quotes in Vectors*/..}
.forEachRDD {rdd => Statistics.corr(RDD[Vector])}
To my mind, this could be a solution if the resulting dstream (from reduceByKeyAndWindow) contains only 1 rdd with all aggregated quotes.
But I am not sure. How is the data distributed after reduceByKeyAndWindow? Is there a way to merge rdds in a dstream?