TechQA.

Does using FP16 help accelerate generation? (HuggingFace BART)

891 views Asked by Allan-J At 28 September 2020 at 12:16

I follow the guide below to use FP16 in PyTorch. https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/

Basically, I'm using BART in HuggingFace for generation

During the training phase, I'm able to get 2x speedup and less GPU memory consumption

But.

I found out there is no speedup when I call model.generate under torch.cuda.amp.autocast().

with torch.cuda.amp.autocast():
   model.generate(...)

When I save the model by:

model.save_pretrained("model_folder")

the size does not decrease to half. But I have to call model.half() before saving in order to make the model half size.

Thus, my questions:

Is the issue in 1. expected or there should be something I did wrong?
Is the operation I did in 2. proper?

There are 0 answers