I am trying to build a RL model with deep Q-Learning using RL4J in the Anylogic PLE as part of my thesis. Unfortunately, I am not overly familiar with Anylogic and DL4J and therefore might be missing some obvious steps. I only have access to the PLE and am wondering what is the best approach to train a RL model in Anylogic. All the examples I found online (Traffic light, Vehicle battery) either use custom experiments or export the project as a stand-alone application to train their RL model. These functionalities aren’t accessible in the PLE and therefore I tried to come up with a different way.
A crucial part of the mentioned examples is the creation and destruction of the engine in the RL models reset() function. I am unaware of a method to do the same in the PLE without stopping the simulation all together. My basic idea of a work around was to create a function in my main agent which resets the environment to its initial state as best as possible.
A bit more about my setup: I created a separate RL agent in Anylogic which has one function containing all of the RL4J code. To train the model this function then gets called from the main agent which contains my environment and all the functions to interact with the environment (get observations, take actions, calculate rewards and check if done). On top of that the main agent contains the aforementioned reset function which resets the state chart (my environment) to the initial state, the step counter (for reward calculation) etc. Unfortunately, I wasn’t able to get this running yet, as the state of the state chart doesn't seem to update after the RL agent took an action. Hence, I can’t tell if my attempt is feasible at all or if it could never work.
I wanted to ask you if my attempt will work once I figure out what is causing the current issue or whether there is a better way of training a RL agent inside the Anylogic PLE.
I don't know if it's of importance, but the code inside my RL agent's training function is:
MDP<Encodable, Integer, DiscreteSpace> mdp = new MDP<Encodable, Integer, DiscreteSpace>() {
ObservationSpace<Encodable> observationSpace = new ArrayObservationSpace<>(new int[] {18});
DiscreteSpace actionSpace = new DiscreteSpace(4);
public ObservationSpace<Encodable> getObservationSpace() {
return observationSpace;
}
public DiscreteSpace getActionSpace() {
return actionSpace;
}
public Encodable getObservation(){
System.out.println(Arrays.toString(main.getObservation()));
return new Encodable() {
double[] a = main.getObservation();
public double[] toArray() {
return a;
}
public boolean isSkipped() {
return false;
}
public INDArray getData() {
return null;
}
public Encodable dup() {
return null;
}
};
}
public Encodable reset() {
System.out.println("Reset");
main.resetExperiment();
return getObservation();
}
public void close() {
System.out.println("Close");
}
public StepReply<Encodable> step(Integer action) {
System.out.println("Took action: "+action);
main.takeAction(action);
double reward = main.calcReward();
System.out.println("Reward: "+reward);
return new StepReply(getObservation(), reward, isDone(), null);
}
public boolean isDone() {
return main.is_done;
}
public MDP<Encodable, Integer, DiscreteSpace> newInstance() {
return null;
}
};
try {
DataManager manager = new DataManager(true);
QLearning.QLConfiguration AL_QL =
new QLearning.QLConfiguration(
1,
10000,
100000,
100000,
128,
1000,
10,
1,
0.99,
1.0,
0.1f,
30000,
true
);
DQNFactoryStdDense.Configuration AL_NET =
DQNFactoryStdDense.Configuration.builder()
.l2(0).updater(new RmsProp(0.001)).numHiddenNodes(300).numLayer(2).build();
QLearningDiscreteDense<Encodable> dql = new QLearningDiscreteDense(mdp, AL_NET, AL_QL, manager);
dql.train();
DQNPolicy<Encodable> pol = dql.getPolicy();
pol.save("Statechart.zip");
mdp.close();
}catch (IOException e){
e.printStackTrace();
}
If you need any further information please let me know.
Looking forward to any suggestions and thank you!