I am checking the gpu memory usage in the training step.
To start with the main question, checking the gpu memory using the torch.cuda.memory_allocated
method is different from checking with nvidia-smi
. And I want to know why.
Actually, I measured the gpu usage using the vgg16 model.
This code prints the theoretical feature map size and weight size:
import torch
import torch.nn as nn
from functools import reduce
Model_number = 7
Model_name = ["alexnet", "vgg11_bn", "vgg16_bn", "resnet18", "resnet50", "googlenet", "vgg11", "vgg16"]
Model_weights = ["AlexNet_Weights", "VGG11_BN_Weights", "VGG16_BN_Weights", "ResNet18_Weights", "ResNet50_Weights", "GoogLeNet_Weights", "VGG11_Weights", "VGG16_Weights"]
exec(f"from torchvision.models import {Model_name[Model_number]}, {Model_weights[Model_number]}")
exec(f"weights = {Model_weights[Model_number]}.DEFAULT")
exec(f"model = {Model_name[Model_number]}(weights=None)")
weight_memory_allocate = 0
feature_map_allocate = 0
weight_type = 4 # float32 = 4, half = 2
batch_size = 128
input_channels = 3
input_size = [batch_size, 3, 224, 224]
def check_model_info(m):
global input_size
global weight_memory_allocate, feature_map_allocate
if isinstance(m, nn.Conv2d):
in_channels, out_channels = m.in_channels, m.out_channels
kernel_size, stride, padding = m.kernel_size[0], m.stride[0], m.padding[0]
# weight
weight_memory_allocate += in_channels * out_channels * kernel_size * kernel_size * weight_type
# bias
weight_memory_allocate += out_channels * weight_type
# feature map
feature_map_allocate += reduce(lambda a, b: a * b, input_size, 1) * weight_type
out_len = int((input_size[2] + 2 * padding - kernel_size)/stride + 1)
input_size = [batch_size, out_channels, out_len, out_len]
elif isinstance(m, nn.Linear):
input_size = [batch_size, reduce(lambda a, b: a * b, input_size[1:], 1)]
in_nodes, out_nodes = m.in_features, m.out_features
# weight
weight_memory_allocate += in_nodes * out_nodes * weight_type
# bias
weight_memory_allocate += out_nodes * weight_type
#feature map
feature_map_allocate += reduce(lambda a, b: a * b, input_size, 1) * weight_type
input_size = [batch_size, out_nodes]
elif isinstance(m, nn.MaxPool2d):
out_len = int((input_size[2] + 2 * m.padding - m.kernel_size)/m.stride + 1)
input_size = [batch_size, input_size[1], out_len, out_len]
print("origial memory allocate")
print(f"total = {(weight_memory_allocate + feature_map_allocate)/1024.0/1024.0:.2f}MB")
print(f"weight = {weight_memory_allocate/1024.0/1024.0:.2f}MB")
print(f"feature_map = {feature_map_allocate/1024.0/1024.0:.2f}MB")
origial memory allocate
total = 4978.54MB
weight = 527.79MB
feature_map = 4450.75MB
And this code checks gpu usage with torch.cuda.memory_allocated
def test_memory_training(in_size=(3,224,224), out_size=1000, optimizer_type=torch.optim.SGD, batch_size=1, use_amp=False, device=0):
sample_input = torch.randn(batch_size, *in_size, dtype=torch.float32)
optimizer = optimizer_type(model.parameters(), lr=.001)
print(f"After model to device: {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
for i in range(5):
print("Iteration", i)
with torch.cuda.amp.autocast(enabled=use_amp):
a = torch.cuda.memory_allocated(device)
out = model(sample_input.to(device)).sum() # Taking the sum here just to get a scalar output
b = torch.cuda.memory_allocated(device)
print(f"After forward pass {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
print(f"Memory consumed by forward pass {to_MB(b - a):.2f}MB")
print(f"After backward pass {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
print(f"After optimizer step {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
def to_MB(a):
return a/1024.0/1024.0
After model to device: 529.04MB
Iteration 0
After forward pass 9481.04MB
Memory consumed by forward pass 8952.00MB
After backward pass 1057.21MB
After optimizer step 1057.21MB
Iteration 1
After forward pass 10009.21MB
Memory consumed by forward pass 8952.00MB
After backward pass 1057.21MB
After optimizer step 1057.21MB
This is the result output by nvidia-smi
when training:
Here's a more detailed question:
I think Pytorch store the following 3 things in the training step.
- model parameters
- input feature map in forward pass
- model gradient information for optimizer
And I think in the forward pass, input feature map should be stored. But in theory, I thought 4450.75MB should be stored in memory, but actually 8952.00MB is stored. Almost 2 times difference.
And if you check the memory usage using nvidia-smi
and torch.cuda.memory_allocated
, the memory usage using nvidia-smi
shows about twice as much memory.
what makes this difference?
Thanks for reading the long question. Any help is appreciated.
What is displayed in
is probably not the allocated memory, but the reserved memory.You can also read out the reserved memory using