Problems when profiling LLM-training using "huggingface/accelerate" to Night system

Question

Problems when profiling LLM-training using "huggingface/accelerate" to Night system

164 views Asked by 상현박 At 20 February 2024 at 11:28

I am learning the Llama model in a multi-node environment using huggingface/accelerate, and if I run it as follows to profile it, the program will die due to a problem with the ssh connection to another node.

$ nsys profile accelerate launch train.py -b 1 -m Llama-2-7b-chat-hf -o sgd -t

screenshot

I know it's not an accurate profiling method for multi-node, but I thought at least profiling would work. But I can't connect to other nodes because I used nsys command…

Also, after that, if I don't give the nsys command and just run the application, the application won't work the same issue. Eventually, I have to stop the docker container and run it again to fix the issue… What is it?

Original Q&A

There are 1 answers

**Zois Tasoulas** · Accepted Answer · 2024-02-26T05:10:48+00:00

The solution to this is to create a bash script and put the nsys command in it. Make sure to use the -o/--output switch and provide a report name using %p, that way reports from different ranks will not collide.

Launch the bash script with huggingface using the --no_python option, e.g., accelerate launch --no_python <bash script>.

These steps are also described at docs.nvidia.com/nsight-systems/UserGuide/index.html#deepspeed for similar parallel job launchers.

TechQA.

Problems when profiling LLM-training using "huggingface/accelerate" to Night system

There are 1 answers

Related Questions in NSIGHT

Related Questions in ACCELERATE

Related Questions in DEEPSPEED

Related Questions in NSIGHT-SYSTEMS

Popular Questions

Trending Questions