A simple distributed training python program for deep learning models by Horovod on GPU cluster

Question

A simple distributed training python program for deep learning models by Horovod on GPU cluster

285 views Asked by user3448011 At 11 July 2020 at 21:15

I am trying to run some example python3 code https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html on databricks GPU cluster (with 1 driver and 2 workers).

Databricks environment:

 ML 6.6, scala 2.11, Spark 2.4.5, GPU

It is for distributed deep learning model training.

I just tried a very simple example at first:

 from sparkdl import HorovodRunner
 hr = HorovodRunner(np=2)

 def train():
   print('in train')
   import tensorflow as tf
   print('after import tf')
   hvd.init()
   print('done')

 hr.run(train)

But, the command is alway running without any progress.

HorovodRunner will stream all training logs to notebook cell output. If there are too many 
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log  callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.

### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod  Timeline. 
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable  to the location 
of the
timeline file to be created. You can then open the timeline file  using the chrome://tracing
facility of the Chrome browser.

Do I miss something or need to set up something to make it work ?

Thanks

Original Q&A

There are 1 answers

**Erik** · Answer 1 · 2022-04-21T13:20:54+00:00

Erik On 21 April 2022 at 13:20

your code does no actual training within it.. you might have better luck with the better example code

https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/mnist-pytorch.html

TechQA.

A simple distributed training python program for deep learning models by Horovod on GPU cluster

There are 1 answers

Related Questions in DEEP-LEARNING

Related Questions in GPU

Related Questions in DATABRICKS

Related Questions in HOROVOD

Related Questions in DISTRIBUTED-TRAINING

Popular Questions

Popular Tags

Trending Questions