Insufficient cluster resources to launch trial - has only 0 GPUs

899 views Asked by At

I am following this tutorial (which is basically this) in order to use ray tune for hyperparemeter optimization. My model is training fine on the GPU without the optimization but now I want to optimize.

I applied the tutorial to my code but when I try to kick off the thing:

result = tune.run(
    train,
    resources_per_trial={"gpu": 1},
    config=config,
    num_samples=10,
    scheduler=scheduler,
    progress_reporter=reporter,
    checkpoint_at_end=False,
)

I'm stuck with:

TuneError: Insufficient cluster resources to launch trial: trial requested 1 CPUs, 1 GPUs, but the cluster has only 6 CPUs, 0 GPUs, 12.74 GiB heap, 4.39 GiB objects (1.0 node:XXX).

But then again, when I take a look at the ray dashboard:

ray dashboard

there clearly are both GPUs listed.

Why isn't ray tune seeing my GPUs? How do I make this work?

Specs:

GPU 0: TITAN Xp
GPU 1: GeForce GTX 1080 Ti
CUDA 10.1
Python 3.7
PyTorch 1.7
Debian 9.12
ray tune 1.0.1.post1

//edit:

ray.init(num_gpus=1)
ray.get_gpu_ids()

[]

1

There are 1 answers

0
Asimandia On

I would suggest checking out placement_strategy and max_t variables. First can cause freezes dependent on your system specification and second can just exceed total time for computation