Executing ray on distributed computing

1.5k views Asked by At

I have data on a server that I have made on a virtualbox and i have a grpc server there. On my local machine, I have created a grpc client that sends a class instance to the server, the server executes a method and returns the result.

I am trying to implement the same using ray. I am unable to understand what tools I should use? Should I be creating a ray cluster in the local machine? If yes, how should I be writing a program on the virtual box to connect to the cluster and execute the required method on the data in the virtualbox.

1

There are 1 answers

10
Alex On

Edit Feb 4 2021: There is now a "correct" way to do this. Use the ray client!

Here's on possible configuration for a 2 node setup for Ray with your use case:

  • Treat the VM as the head node of your cluster. You can initialize the cluster via ray up --head --resources='{data: 1} (the data: 1 part will become relevant in a second).

  • Now you can connect to it from your local machine via ray.init(address="...") in a python script (you may need to make sure your networking/port forwarding for your VM is setup correctly).

  • Ray will try to run tasks/actors on any node with sufficient resources by default. Presumably, you want most of your tasks to run the VM, so there's 2 reasonable ways of implementing this.

Approach 1:

You can use num_cpus=0 to tell Ray that your local machine has no resources, therefore it should run everything on the head node. This approach is preferable if you want to treat your local machine purely as a client. You don't need the --resources flag if you do this.

import ray
ray.init(address="...", num_cpus=0)

@ray.remote
def foo():
    print("This runs on the VM")


print("This runs locally")
ray.get(foo.remote())

Approach 2:

Use custom resources to constrain which nodes your task can run on.

import ray
ray.init(address="...")

@ray.remote(resources={"data": 0.01})
def foo():
    print("This runs on the VM")

print("This runs locally")
ray.get(foo.remote())

Since the remote function requires some of the "data" resource, it will run on the head node (and since it only requires 0.01 of that resource, many tasks can run on the head node at once).

This approach is preferable is you plan on connecting more nodes to your cluster in the future, and some of them have your data and some don't.