I'm new to distributed operating system. And I need to train about multiple machine learning models with supercomputers. I need to run the same training script multiple times, and for each run passing the script with a different command line argument. Can I achieve this by using mpiexec so that I can train multiple models in parallel with different inputs?
I found Single Program Multiple data model of mpi, but I don't know the corresponding commands.
I want to run the following line in parallel among the computation nodes in the cluster.
python train.py arg > log.out # arg is the argument that differs for each node
But, if I'm using:
mpiexec train.py arg >log.out
it would only run train.py with the same command line argument: arg for multiple times in parallel.
Can someone point out the right way to do it? Thank you!
One way to achieve what you want is to create a top level script,
mpi_train.py
using mpi4py. In an MPI job, each process has a unique rank and all run the same code, so running,with
will give
The different ranks can then be used to read a separate file which specifies the args. So you'd have something like,
Note that you should explicitly specify the output for each process, with something like,
otherwise all prints go to stdout and become jumbled as order of the various processes is not guaranteed.