How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I started using Ray RLLib (version 2.7.1) to solve a RL problem with customized environments and agents. While the envs are properly set, when I started to customize models, I am quite confused and can not proceed on.
1. What does the parameters mean in each of the methods? As is in the official docs, one may customize the PyTorch model like this:
class MyModel(TorchModelV2, nn.Module):
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
TorchModelV2.__init__(self, obs_space, action_space,
num_outputs, model_config, name)
nn.Module.__init__(self)
# __init__ function logic
def forward(self, input_dict, state, seq_lens):
# forward function logic
def value_function(self):
# value function logic
But the official docs do not tell what does it mean for each of the parameters. Where can I get the detailed explanations?
2. How does the _disable_preprocessor_api
act?
What confused me most is that the obs_space
parameter in the __init__
method comes differently from what I have defined. Also, the input_dict
parameter in forward
method contains the observation input_dict['obs']
which has the data structure unclear to me. What's more, there seems to be a discrepancy between the default behaviours described in the official docs and the behaviours appear in my case.
Let's be specific. My observation space is:
operation_space = Dict({
'a': Discrete(13),
'b': Tuple([
Dict({
'b1': Discrete(3),
'b2': Box(low=-10, high=10, shape=(1,), dtype=np.float64),
'b3': Discrete(7)
}) for _ in range(2)
]),
'c': Discrete(7)
})
self.observation_space = Tuple([operation_space]*10)
I am writing my config as this:
config = (
get_trainable_cls('PPO')
.get_default_config()
.rl_module(_enable_rl_module_api=False)
.training(
model={
"_disable_preprocessor_api": True,
"custom_model": "my_model",
# "custom_model_config": {
# "input_files": args.input_files,
# },
},
_enable_learner_api=False
)
.environment(RLSearchEnv, env_config=RLSearchEnv_config)
.framework("torch")
.rollouts(num_rollout_workers=1)
.resources(num_gpus=2)
.experimental(_disable_preprocessor_api=True)
)
There are two _disable_preprocessor_api
's in the config, although they are explained same in the official docs, it shows up different behaviours when I set them with different values.
Case 1. model={"_disable_preprocessor_api": False, ...}
with .experimental(_disable_preprocessor_api=False)
, which is the default behaviour of RLLib and the default preprocessors are applied.
The obs_space
is Box(-1.0, 1.0, (420,), float32)
, which is the one-hot encoded and flattened version of my original definition. I've checked the size match.
The input_dict['obs']
preserves the original nested structure of my observation space (i.e. it is a 10-length list), but in each Discrete
subspace, it is now the one-hot encoded torch.tensor with additional batch dimension:
>>> input_dict['obs'][0]['a'].shape
torch.Size([32, 13])
Case 2. model={"_disable_preprocessor_api": True, ...}
with .experimental(_disable_preprocessor_api=False)
The obs_space
is Box(-1.0, 1.0, (420,), float32)
, same as case 1.
The input_dict['obs']
is now a torch.tensor with shape [32, 420]. I guess that it flattens the observation and prepends a batch dimension.
Case 3. model={"_disable_preprocessor_api": False, ...}
with .experimental(_disable_preprocessor_api=True)
The obs_space
now preserves the original nested structure, it is the same as the self.observation_space
.
The input_dict['obs']
preserves the original nested structure of my observation space, but different from the case 1, it does not one-hot encode the discrete space, but still add a batch dimension:
>>> input_dict['obs'][0]['a'].shape
torch.Size([32])
Case 4. model={"_disable_preprocessor_api": True, ...}
with .experimental(_disable_preprocessor_api=True)
The obs_space
is the same as the self.observation_space
.
The input_dict['obs']
is same as the case 3.
They are too complicated to be understood, and I'm unable to continue programming if I do not figure out what does they exactly mean. It would be greatly appreciated how to use this parameters. Thanks a lot in advance.
Please refer to the cases in the problem description.