I have two servers; each of them with a GPU. I'd like to run a reinforcement learning algorithm that utilizes both servers at the same time using Ray.

I imagine that one of the servers should act as primary data store, as well as running the main driver process that adapts the neural network weights based on the results received from the servers.

Following this quick start guide and using this cluster file, I am getting the following output:

2019-06-02 04:29:47,169 INFO node_provider.py:34 -- ClusterState: Loaded cluster state: {}
2019-06-02 04:29:47,170 INFO node_provider.py:59 -- ClusterState: Writing cluster state: {'YOUR_HEAD_NODE_HOSTNAME': {'tags': {'ray-node-type': 'head'}, 'state': 'terminated'}}
This will create a new cluster [y/N]: y
2019-06-02 04:29:49,023 INFO commands.py:189 -- get_or_create_head_node: Launching new head node...
2019-06-02 04:29:49,024 INFO node_provider.py:77 -- ClusterState: Writing cluster state: {'YOUR_HEAD_NODE_HOSTNAME': {'tags': {'ray-node-type': 'head', 'ray-launch-config': '5a0ccc99d6349f2fb9699284ae2a3547c548975f', 'ray-node-name': 'ray-default-head'}, 'state': 'running'}}
2019-06-02 04:29:49,024 INFO commands.py:202 -- get_or_create_head_node: Updating files on head node...
Traceback (most recent call last):
  File "/usr/local/bin/ray", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/ray/scripts/scripts.py", line 771, in main
    return cli()
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/scripts/scripts.py", line 462, in create_or_update
    no_restart, restart_only, yes, cluster_name)
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/commands.py", line 47, in create_or_update_cluster
    override_cluster_name)
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/commands.py", line 241, in get_or_create_head_node
    initialization_commands=config["initialization_commands"],
KeyError: 'initialization_commands'

Any idea what's going wrong here? Ideally I'd like to have a super simple example of getting this set up.

0 Answers