We are using spark-ml to build the model from existing data. New data comes on daily basis.
Is there a way that we can only read the new data and update the existing model without having to read all the data and retrain every time?
We are using spark-ml to build the model from existing data. New data comes on daily basis.
Is there a way that we can only read the new data and update the existing model without having to read all the data and retrain every time?
It depends on the model you're using but for some Spark does exactly what you want. You can look at StreamingKMeans, StreamingLinearRegressionWithSGD, StreamingLogisticRegressionWithSGD and more broadly StreamingLinearAlgorithm.