I am learning the Llama model in a multi-node environment using huggingface/accelerate, and if I run it as follows to profile it, the program will die due to a problem with the ssh connection to another node.
$ nsys profile accelerate launch train.py -b 1 -m Llama-2-7b-chat-hf -o sgd -t
I know it's not an accurate profiling method for multi-node, but I thought at least profiling would work. But I can't connect to other nodes because I used nsys command…
Also, after that, if I don't give the nsys command and just run the application, the application won't work the same issue. Eventually, I have to stop the docker container and run it again to fix the issue… What is it?
The solution to this is to create a bash script and put the
nsys
command in it. Make sure to use the-o/--output
switch and provide a report name using%p
, that way reports from different ranks will not collide.Launch the bash script with huggingface using the
--no_python
option, e.g.,accelerate launch --no_python <bash script>
.These steps are also described at docs.nvidia.com/nsight-systems/UserGuide/index.html#deepspeed for similar parallel job launchers.