Debugging my program on big counts of kernels, I faced with very strange error of insufficient virtual memory
. My investigations lead to peace of code, where master sends small messages to each slave. Then I wrote small program, where 1 master simply send 10 integers with MPI_SEND
and all slaves receives it with MPI_RECV
. Comparison of files /proc/self/status
before and after MPI_SEND
showed, that difference between memory sizes is huge! The most interesting thing (which crashes my program), is that this memory won't deallocate after MPI_Send
and still take huge space.
Any ideas?
System memory usage before MPI_Send, rank: 0
Name: test_send_size
State: R (running)
Pid: 7825
Groups: 2840
VmPeak: 251400 kB
VmSize: 186628 kB
VmLck: 72 kB
VmHWM: 4068 kB
VmRSS: 4068 kB
VmData: 71076 kB
VmStk: 92 kB
VmExe: 604 kB
VmLib: 6588 kB
VmPTE: 148 kB
VmSwap: 0 kB
Threads: 3
System memory usage after MPI_Send, rank 0
Name: test_send_size
State: R (running)
Pid: 7825
Groups: 2840
VmPeak: 456880 kB
VmSize: 456872 kB
VmLck: 257884 kB
VmHWM: 274612 kB
VmRSS: 274612 kB
VmData: 341320 kB
VmStk: 92 kB
VmExe: 604 kB
VmLib: 6588 kB
VmPTE: 676 kB
VmSwap: 0 kB
Threads: 3
This is an expected behaviour from almost any MPI implementation that runs over InfiniBand. The IB RDMA mechanisms require that data buffers should be registered, i.e. they are first locked into a fixed position in the physical memory and then the driver tells the InfiniBand HCA how to map virtual addresses to physical memory. It is very complex and hence very slow process to register memory for usage by the IB HCA and that's why most MPI implementations never unregister memory that was once registered in hope that the same memory would later be used as a source or data target again. If the registered memory was heap memory, it is never returned back to the operating system and that's why your data segment only grows in size.
Reuse send and receive buffers as much as possible. Keep in mind that communication over InfiniBand incurrs high memory overhead. Most people don't really think about this and it is usually poorly documented, but InfiniBand uses a lot of special data structures (queues) which are allocated in the memory of the process and those queues grow significantly with the number of processes. In some fully connected cases the amount of queue memory can be so large that no memory is actually left for the application.
There are certain parameters that control IB queues used by Intel MPI. The most important in your case is
I_MPI_DAPL_BUFFER_NUM
which controls the amount of preallocated and preregistered memory. It's default value is16
, so you might want to decrease it. Be aware of possible performance implications though. You can also try to use dynamic preallocated buffer sizes by settingI_MPI_DAPL_BUFFER_ENLARGEMENT
to1
. With this option enabled, Intel MPI would initially register small buffers and will later grow them if needed. Note also that IMPI opens connections lazily and that's why you see the huge increase in used memory only after the call toMPI_Send
.If not using the DAPL transport, e.g. using the
ofa
transport instead, there is not much that you can do. You can enable XRC queues by settingI_MPI_OFA_USE_XRC
to1
. This should somehow decrease the memory used. Also enabling dynamic queue pairs creation by settingI_MPI_OFA_DYNAMIC_QPS
to1
might decrease memory usage if the communication graph of your program is not fully connected (a fully connected program is one in which each rank talks to all other ranks).