So i am working on solving variant of VRPTW with an actor-critic with a Graph neural network and attention mechanism. for now i am finding some difficulties to decide at what stage to apply the constraints that VRPTW has on my approach, do i add somehow the constraints into the step function or do i add it separately and then call it in the step and actor-critic.
this is my first time working on reinforcement learning , i am using tensorflow and a custom environment extended from gym.
for now i have tried writing the code for the calculate costs outside the step function. and the reward is equal -costs (reward = -total_costs)
and i am not really able to test it , to see the results , but something looks wrong to me in this approach and that is why i am askiing for some help