Scala - Spark Dstream operation similar to Cbind in R

195 views Asked by At

1) I am trying to use MLlib Random Forest . my final output should have 2 columns

id, predicted_value 
1,  0.5 
2,  0.4 

my feature sets are training data and scoring --- train , score but when I train and score I drop the id field as it could not be used as feature as it is unique for each row and has no intelligence in predicting, now I get the score predicted

my scored output looks like

predicted_value 
0.5 
0.4 

But I want to tie it back to id

I am having id field in separate DStream and predicted_value in separate DStream. How to I bind it to each other, I don't have any column field to make a join.

now how do I tie it back . For example R has a function cbind which can bind 2 columns from different data frames

x<-data.frame(cbind(testIds,p$p1)) 

Is it possible or any alternatives?

2) I am using a MLlib randomforest model to predict using spark streaming. In the end, I want to combine the feature Dstream & prediction Dstream together for further downstream processing. How can I do that?

Thanks in advance.

1

There are 1 answers

0
user7735111 On

You can use DStream.transform and predict:

 dstream.transform(rdd =>  {
   val predictions = model.predict(rdd)
   rdd.zip(predictions)
 })