I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.
The
SVM
predictor expects as input a sub type ofVector
. Hence there are two options to solve this problem:Vector
which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:DataSet
with the inputDataSet
on the vectorized representation of thetweets
. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive: