I am trying to infer results out of a normal resnet18 model present in torchvision.models attribute. The model is simply trained without any mixed precision learning, purely on FP32.
However, I want to get faster results while inferencing, so I enabled torch.cuda.amp.autocast() function only while running a test inference case.
The code for the same is given below -
model = torchvision.models.resnet18()
model = model.to(device) # Pushing to GPU
# Train the model normally
Without amp -
tensor = torch.rand(1,3,32,32).to(device) # Random tensor for testing
with torch.no_grad():
model.eval()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
model(tensor) # warmup
model(tensor) # warmpup
start.record()
for i in range(20): # total time over 20 iterations
model(tensor)
end.record()
torch.cuda.synchronize()
print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))
execution time in milliseconds: 5.264944076538086
With amp -
tensor = torch.rand(1,3,32,32).to(device)
with torch.no_grad():
model.eval()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
model(tensor)
model(tensor)
start.record()
with torch.cuda.amp.autocast(): # autocast initialized
for i in range(20):
model(tensor)
end.record()
torch.cuda.synchronize()
print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))
execution time in milliseconds: 10.619884490966797
Clearly, the autocast() enabled code is taking double the time. Even, with larger models like resnet50, the timing variation is approximately the same.
Can someone help me out regarding this ? I am running this example on Google Colab and below are the specifications of the GPU
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 43C P0 28W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
torch.version.cuda == 10.1
torch.__version__ == 1.8.1+cu101
It's most likely because of the GPU you're using - P100, which has 3584 CUDA cores but 0 tensor cores -- the latter of which typically play the main role in mixed precision speedup. You may want to take a quick look at the "Hardware Comparison" section on this article.
If you're stuck to using Colab, the only way I can foresee a possible speedup is if you get assigned a T4, which has tensor cores.
Furthermore, it seems like you're using only a single image / a batch size of 1. If you get a T4, try re-running your benchmarks also using a larger batch size, like maybe 32-64-128-256 etc. You should be able to notice much more visible improvements when you parallelize over batches.