in this web https://huggingface.co/docs/accelerate/v0.24.0/en/concept_guides/performance#learning-rates
we can see this:
Learning Rates
As noted in multiple sources[1][2], the learning rate should be scaled linearly based on the number of devices present. The below snippet shows doing so with Accelerate:
Since users can have their own learning rate schedulers defined, we leave this up to the user to decide if they wish to scale their learning rate or not.
Copied
learning_rate = 1e-3
accelerator = Accelerator()
learning_rate *= accelerator.num_processes
optimizer = AdamW(params=model.parameters(), lr=learning_rate)
You will also find that accelerate will step the learning rate based on the number of processes being trained on. This is because of the observed batch size noted earlier. So in the case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).
when i read that i think: Effective batch size increases linearly with number of devices. With more devices, the total batch size seen by the model during each optimization step is larger. A larger batch size can require a higher learning rate for efficient training.
but i use accelerate with deepspeed, i use:
lr = 5e-5
optimizer = DeepSpeedCPUAdam(model.parameters(), lr=lr * accelerator.num_processes)
i find that
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000350, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
i think the log out put show that my alpha(lr) is 5e-05*7 , so i am not sure is right or not.
Should the learning rate on each graphics card be 5e-5 or 5e-5 * device?
Will it affect whether I choose deepspeed or not in accelerate?
that's all.
Thank you for seeing this. It would be even better if you could answer my doubts.