Multi-machine training bug when using Transformers-4.34.1

33 views Asked by mmy At 18 March 2024 at 07:25

When I use 4 * 8 GPUs to train a 7B model with batch size of 4096 (per_device_train_batch_size is 8, and gradient_accumulation_steps is 16), loss can naturally decrease within ten thousand steps.

However, when I use 8 * 8 GPUs to train the same model with the same data, the same random seed, and the same batch size (per_device_train_batch_size is 8, and gradient_accumulation_steps is 8), the loss is always a little higher than 32 GPUs, and will skyrocket after about 2000 steps. I think when keeping the random seed and batch size same, the training process should be equivalent. My transformers==4.34.1 and deepspeed==0.9.4.

What parameters should I change when using 8 * 8 GPUs to train, compared to 4 * 8 GPUs in order to keep the result unchanged?

Original Q&A

TechQA.

Multi-machine training bug when using Transformers-4.34.1

There are 0 answers

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in OPENAI-API

Related Questions in CHATGPT-API

Popular Questions

Trending Questions