Correct approach to improve/retrain an offiline model

73 views Asked by At

I have a recommendation system that was trained using Behavior Cloning (BC) with offline data generated using a supervised learning model converted to batch format using the approach described here. Currently, the model is exploring using an e-greedy strategy. I want to migrate from BC to MARWIL changing the beta.

There is a couple of ways to do that:

  1. Convert the data employed to train the BC algorithm plus the agent’s new data and retrain from scratch using MARWIL.
  2. Convert the new data generated by the agent and put it together with the previous converted data employed to train the BC algorithm, using the input parameter, doing something similar to what is described here, and retrain from scratch using MARWIL .
  3. Convert the new data generated by the agent and put it together with the previous converted data employed to train the BC algorithm, using the input parameter, doing something similar to what is described here, and retrain using the restored BC agent using MARWIL . Questions:

Following option 1.:

Given that the new data slice would be very small compared with the previous one, would the model learn something new? When we stop using original data?

Following option 2.:

Given that the new data slice would be very small compared with the previous one, would the model learn something new? When we stop using original data? This approach works for trajectories associated with new episodes ids, but it will extend the trajectories of episodes already present in the original batch?

Following option 3.:

Given that the new data slice would be very small compared with the previous one, would the model learn something new? When we stop using original data? This approach works for trajectories associated with new episodes ids, but it will extend the trajectories of episodes already present in the original batch? The retrain would update the networks’ weights using the new data points, but to do that how many iterations should we use? How to prevent catastrophic forgetting?

0

There are 0 answers