When I use 4 * 8 GPUs to train a 7B model with batch size of 4096 (per_device_train_batch_size is 8, and gradient_accumulation_steps is 16), loss can naturally decrease within ten thousand steps.
However, when I use 8 * 8 GPUs to train the same model with the same data, the same random seed, and the same batch size (per_device_train_batch_size is 8, and gradient_accumulation_steps is 8), the loss is always a little higher than 32 GPUs, and will skyrocket after about 2000 steps. I think when keeping the random seed and batch size same, the training process should be equivalent. My transformers==4.34.1 and deepspeed==0.9.4.
What parameters should I change when using 8 * 8 GPUs to train, compared to 4 * 8 GPUs in order to keep the result unchanged?