I follow the guide below to use FP16 in PyTorch. https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
Basically, I'm using BART in HuggingFace for generation
- During the training phase, I'm able to get 2x speedup and less GPU memory consumption
But.
- I found out there is no speedup when I call
model.generate
undertorch.cuda.amp.autocast()
.
with torch.cuda.amp.autocast():
model.generate(...)
- When I save the model by:
model.save_pretrained("model_folder")
the size does not decrease to half. But I have to call model.half()
before saving in order to make the model half size.
Thus, my questions:
- Is the issue in
1.
expected or there should be something I did wrong? - Is the operation I did in
2.
proper?