Does using FP16 help accelerate generation? (HuggingFace BART)

909 views Asked by At

I follow the guide below to use FP16 in PyTorch. https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/

Basically, I'm using BART in HuggingFace for generation

  1. During the training phase, I'm able to get 2x speedup and less GPU memory consumption

But.

  1. I found out there is no speedup when I call model.generate under torch.cuda.amp.autocast().
with torch.cuda.amp.autocast():
   model.generate(...)
  1. When I save the model by:
model.save_pretrained("model_folder")

the size does not decrease to half. But I have to call model.half() before saving in order to make the model half size.

Thus, my questions:

  • Is the issue in 1. expected or there should be something I did wrong?
  • Is the operation I did in 2. proper?
0

There are 0 answers