I am trying to infer results out of a normal resnet18
model present in torchvision.models
attribute. The model is simply trained without any mixed precision learning, purely on FP32.
However, I want to get faster results while inferencing, so I enabled torch.cuda.amp.autocast()
function only while running a test inference case.
The code for the same is given below -
model = torchvision.models.resnet18()
model = model.to(device) # Pushing to GPU
# Train the model normally
Without amp
-
tensor = torch.rand(1,3,32,32).to(device) # Random tensor for testing
with torch.no_grad():
model.eval()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
model(tensor) # warmup
model(tensor) # warmpup
start.record()
for i in range(20): # total time over 20 iterations
model(tensor)
end.record()
torch.cuda.synchronize()
print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))
execution time in milliseconds: 5.264944076538086
With amp
-
tensor = torch.rand(1,3,32,32).to(device)
with torch.no_grad():
model.eval()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
model(tensor)
model(tensor)
start.record()
with torch.cuda.amp.autocast(): # autocast initialized
for i in range(20):
model(tensor)
end.record()
torch.cuda.synchronize()
print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))
execution time in milliseconds: 10.619884490966797
Clearly, the autocast()
enabled code is taking double the time. Even, with larger models like resnet50
, the timing variation is approximately the same.
Can someone help me out regarding this ? I am running this example on Google Colab and below are the specifications of the GPU
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 43C P0 28W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
torch.version.cuda == 10.1
torch.__version__ == 1.8.1+cu101
It's most likely because of the GPU you're using - P100, which has 3584 CUDA cores but 0 tensor cores -- the latter of which typically play the main role in mixed precision speedup. You may want to take a quick look at the "Hardware Comparison" section on this article.
If you're stuck to using Colab, the only way I can foresee a possible speedup is if you get assigned a T4, which has tensor cores.
Furthermore, it seems like you're using only a single image / a batch size of 1. If you get a T4, try re-running your benchmarks also using a larger batch size, like maybe 32-64-128-256 etc. You should be able to notice much more visible improvements when you parallelize over batches.