SO I created a stable baseline model using A2C to train simple spread environment from pettingzoo (https://pettingzoo.farama.org/environments/mpe/simple_spread/). I referred to the SB3 tutorial provided at pettingzoo for this, and for some reason I am not getting any reward value higher than 0, and even after training the average reward is not going above -300 for around 10 episodes. I want to ask why is this happening because not even one positive reward value is a very weird thing and even in random training in other environments, you can get better rewards. Anyways, here is my implementation for model: `
def train_model(
env_fn, steps: int = 10_000, seed: int= 0, **env_kwargs
):
env = env_fn.parallel_env(**env_kwargs)
env.reset(seed=seed)
print(f"Starting training on {str(env.metadata['name'])}.")
#env = ss.pettingzoo_env_to_vec_env_v1(env)
env = MarkovVectorEnv(env)
env = ss.concat_vec_envs_v1(env, 2, num_cpus=2, base_class="stable_baselines3")
policy_kwargs = {'net_arch': [128,128]}
model = A2C(
MlpPolicy,
env,
verbose=1,
learning_rate= 0.002,
gamma = 0.99,
ent_coef = 0.03,
policy_kwargs= policy_kwargs,
)
model.learn(total_timesteps=steps)
model.save(f"{env.unwrapped.metadata.get('name')}_{time.strftime('%Y%m%d-%H%M%S')}")
print("Model has been saved.")
print(f"Finished training on {str(env.unwrapped.metadata['name'])}.")
env.close()
Any help or guidance would be appreciated.
I tried customizing policy network, or even customizing feature extraction (https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html), but the reward still didn't improve. And with parameters, I have tried everything, and it's still not making sense, with any step size or changing anything in model, even with PPO, it did not work, I looked at simple spread reward fn and it is as follows (I think that it's only subtracting from the reward and not adding anything, which doesn't feel right, but I am not sure I am fairly new to this):
def reward(self, agent, world):
# Agents are rewarded based on minimum agent distance to each landmark, penalized for collisions
rew = 0
for l in world.landmarks:
dists = [np.sqrt(np.sum(np.square(a.state.p_pos - l.state.p_pos))) for a in world.agents]
rew -= min(dists)
if agent.collide:
for a in world.agents:
if self.is_collision(a, agent):
rew -= 1
return rew