I am checking the gpu memory usage in the training step.
To start with the main question, checking the gpu memory using the torch.cuda.memory_allocated
method is different from checking with nvidia-smi
. And I want to know why.
Actually, I measured the gpu usage using the vgg16 model.
This code prints the theoretical feature map size and weight size:
import torch
import torch.nn as nn
from functools import reduce
Model_number = 7
Model_name = ["alexnet", "vgg11_bn", "vgg16_bn", "resnet18", "resnet50", "googlenet", "vgg11", "vgg16"]
Model_weights = ["AlexNet_Weights", "VGG11_BN_Weights", "VGG16_BN_Weights", "ResNet18_Weights", "ResNet50_Weights", "GoogLeNet_Weights", "VGG11_Weights", "VGG16_Weights"]
exec(f"from torchvision.models import {Model_name[Model_number]}, {Model_weights[Model_number]}")
exec(f"weights = {Model_weights[Model_number]}.DEFAULT")
exec(f"model = {Model_name[Model_number]}(weights=None)")
weight_memory_allocate = 0
feature_map_allocate = 0
weight_type = 4 # float32 = 4, half = 2
batch_size = 128
input_channels = 3
input_size = [batch_size, 3, 224, 224]
def check_model_info(m):
global input_size
global weight_memory_allocate, feature_map_allocate
if isinstance(m, nn.Conv2d):
in_channels, out_channels = m.in_channels, m.out_channels
kernel_size, stride, padding = m.kernel_size[0], m.stride[0], m.padding[0]
# weight
weight_memory_allocate += in_channels * out_channels * kernel_size * kernel_size * weight_type
# bias
weight_memory_allocate += out_channels * weight_type
# feature map
feature_map_allocate += reduce(lambda a, b: a * b, input_size, 1) * weight_type
out_len = int((input_size[2] + 2 * padding - kernel_size)/stride + 1)
input_size = [batch_size, out_channels, out_len, out_len]
elif isinstance(m, nn.Linear):
input_size = [batch_size, reduce(lambda a, b: a * b, input_size[1:], 1)]
in_nodes, out_nodes = m.in_features, m.out_features
# weight
weight_memory_allocate += in_nodes * out_nodes * weight_type
# bias
weight_memory_allocate += out_nodes * weight_type
#feature map
feature_map_allocate += reduce(lambda a, b: a * b, input_size, 1) * weight_type
input_size = [batch_size, out_nodes]
elif isinstance(m, nn.MaxPool2d):
out_len = int((input_size[2] + 2 * m.padding - m.kernel_size)/m.stride + 1)
input_size = [batch_size, input_size[1], out_len, out_len]
model.apply(check_model_info)
print("---------------------------------------------------------")
print("origial memory allocate")
print(f"total = {(weight_memory_allocate + feature_map_allocate)/1024.0/1024.0:.2f}MB")
print(f"weight = {weight_memory_allocate/1024.0/1024.0:.2f}MB")
print(f"feature_map = {feature_map_allocate/1024.0/1024.0:.2f}MB")
print("---------------------------------------------------------")
Output:
---------------------------------------------------------
origial memory allocate
total = 4978.54MB
weight = 527.79MB
feature_map = 4450.75MB
---------------------------------------------------------
And this code checks gpu usage with torch.cuda.memory_allocated
:
def test_memory_training(in_size=(3,224,224), out_size=1000, optimizer_type=torch.optim.SGD, batch_size=1, use_amp=False, device=0):
sample_input = torch.randn(batch_size, *in_size, dtype=torch.float32)
optimizer = optimizer_type(model.parameters(), lr=.001)
model.to(device)
print(f"After model to device: {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
for i in range(5):
optimizer.zero_grad()
print("Iteration", i)
with torch.cuda.amp.autocast(enabled=use_amp):
a = torch.cuda.memory_allocated(device)
out = model(sample_input.to(device)).sum() # Taking the sum here just to get a scalar output
b = torch.cuda.memory_allocated(device)
print(f"After forward pass {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
print(f"Memory consumed by forward pass {to_MB(b - a):.2f}MB")
out.backward()
print(f"After backward pass {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
optimizer.step()
print(f"After optimizer step {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
print("---------------------------------------------------------")
def to_MB(a):
return a/1024.0/1024.0
test_memory_training(batch_size=batch_size)
Output:
After model to device: 529.04MB
Iteration 0
After forward pass 9481.04MB
Memory consumed by forward pass 8952.00MB
After backward pass 1057.21MB
After optimizer step 1057.21MB
---------------------------------------------------------
Iteration 1
After forward pass 10009.21MB
Memory consumed by forward pass 8952.00MB
After backward pass 1057.21MB
After optimizer step 1057.21MB
---------------------------------------------------------
......
This is the result output by nvidia-smi
when training:
Here's a more detailed question:
I think Pytorch store the following 3 things in the training step.
- model parameters
- input feature map in forward pass
- model gradient information for optimizer
And I think in the forward pass, input feature map should be stored. But in theory, I thought 4450.75MB should be stored in memory, but actually 8952.00MB is stored. Almost 2 times difference.
And if you check the memory usage using nvidia-smi
and torch.cuda.memory_allocated
, the memory usage using nvidia-smi
shows about twice as much memory.
what makes this difference?
Thanks for reading the long question. Any help is appreciated.
What is displayed in
nvidia-smi
is probably not the allocated memory, but the reserved memory.You can also read out the reserved memory using
torch.cuda