Pytorch mixed precision learning, torch.cuda.amp running slower than normal

2.7k views Asked by At

I am trying to infer results out of a normal resnet18 model present in torchvision.models attribute. The model is simply trained without any mixed precision learning, purely on FP32. However, I want to get faster results while inferencing, so I enabled torch.cuda.amp.autocast() function only while running a test inference case.

The code for the same is given below -

model = torchvision.models.resnet18()
model = model.to(device) # Pushing to GPU

# Train the model normally

Without amp -

tensor = torch.rand(1,3,32,32).to(device) # Random tensor for testing
with torch.no_grad():
  model.eval()
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  model(tensor) # warmup
  model(tensor) # warmpup
  start.record()
  for i in range(20): # total time over 20 iterations 
    model(tensor)
  end.record()
  torch.cuda.synchronize()
    
  print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))

  execution time in milliseconds: 5.264944076538086

With amp -

tensor = torch.rand(1,3,32,32).to(device)
with torch.no_grad():
  model.eval()
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  model(tensor)
  model(tensor)

  start.record()
  with torch.cuda.amp.autocast(): # autocast initialized
    for i in range(20):
      model(tensor)
  end.record()
  torch.cuda.synchronize()
  
  print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))

  execution time in milliseconds: 10.619884490966797

Clearly, the autocast() enabled code is taking double the time. Even, with larger models like resnet50, the timing variation is approximately the same.

Can someone help me out regarding this ? I am running this example on Google Colab and below are the specifications of the GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
torch.version.cuda == 10.1
torch.__version__  == 1.8.1+cu101
1

There are 1 answers

0
Mercury On

It's most likely because of the GPU you're using - P100, which has 3584 CUDA cores but 0 tensor cores -- the latter of which typically play the main role in mixed precision speedup. You may want to take a quick look at the "Hardware Comparison" section on this article.

If you're stuck to using Colab, the only way I can foresee a possible speedup is if you get assigned a T4, which has tensor cores.

Furthermore, it seems like you're using only a single image / a batch size of 1. If you get a T4, try re-running your benchmarks also using a larger batch size, like maybe 32-64-128-256 etc. You should be able to notice much more visible improvements when you parallelize over batches.