All,
I am trying to train a distributed model using Horovod on Azure Machine Learning Service as shown below.
estimator = TensorFlow(source_directory=script_folder,
entry_script='train_script.py',
script_params=script_params,
compute_target=compute_target_gpu_4,
conda_packages=['scikit-learn'],
node_count=2,
distributed_training=MpiConfiguration(),
framework_version = '1.13',
use_gpu=True
)
run = exp.submit(estimator)
- How to enable Horovod timeline?
- How to enable more detailed MPI tracing to see the communication between the nodes?
Thanks.
The following uses the Tensorflow Estimator class in the SDK, that distributed_training is set to Mpi().
Another sample using Horovod to train a genism sentence similarity model. https://github.com/microsoft/nlp-recipes/blob/46c0658b79208763e97ae3171e9728560fe37171/examples/sentence_similarity/gensen_train.py