I am now fine-tuning on VGG-Face (very big model) with 8 TITAN Xp GPUs available. However, Caffe gives out-of-memory error when I increase batch_size
. Here is what I did:
First, batch_size
was set to 40 for training stage and it works fine on a single GPU. The chosen GPU was nearly 100% utilized.
Then, I increased batch_size
to 128 with all the 8 GPUs using
'./build/tools/caffe train -solver mysolver.prototxt -gpu all'
All the GPUs were fully utilized, as is shown in nvidia-smi.jpg
And Caffe gives me the following error:
F0906 03:41:32.776806 95655 parallel.cpp:90] Check failed: error ==cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7f9a0832995d google::LogMessage::Fail()
@ 0x7f9a0832b6e0 google::LogMessage::SendToLog()
@ 0x7f9a08329543 google::LogMessage::Flush()
@ 0x7f9a0832c0ae google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9a08abe825 caffe::GPUParams<>::GPUParams()
@ 0x7f9a08abefd8 caffe::NCCL<>::NCCL()
@ 0x40dc69 train()
@ 0x40a8ed main
@ 0x7f9a06abf830 (unknown)
@ 0x40b349 _start
Aborted (core dumped)
Theoretically I can train with the batch_size=40*8=320
. (Please let me know if I am right here)
So, how can I fully utilized the GPUs to accelerate my model training? Thanks in advance!
When using multiple GPUs, you don't need to increase the batch size in your prototxt. If your batch size was 40, Caffe will use that size for each GPU individually, thus effectively giving you a batch size of 40*8 (without you having to change anything).