How to reduce CUDA context size (Multi-Process Service)

1k views Asked by At

I followed Robert Crovella's example on how to use Nvidia's Multi-Process Service. According to docs:

2.1.2. Reduced on-GPU context storage

Without MPS each CUDA processes using a GPU allocates separate storage and scheduling resources on the GPU. In contrast, the MPS server allocates one copy of GPU storage and scheduling resources shared by all its clients.

which I understood as the reduction of each of the processes' context sizes, which is possible because they are shared. This would increase free GPU memory and thus enable running more processes in parallel.

Now, back to the example. Without MPS:

MPS disabled

And with MPS:

MPS enabled

Unfortunately each process still takes virtually the same (~300MB) amount of memory. Isn't this in contradiction to the docs? Is there a way to decrease per process memory consumption?

2

There are 2 answers

0
alex On BEST ANSWER

Oops, I overeagerly asked before checking the memory usage on the other (pre-Volta) card and yes, there is actually a difference. Let me just post it here for future reference if anyone else stumbled on this problem too:

MPS off:

MPS disabled

MPS on:

MPS enabled

0
Raz Rotenberg On

Indeed, as seen here, in Volta architecture, you can see the processes communicate directly with the GPU, without the MPS server in the middle:

Volta MPS clients submit work directly to the GPU without passing through the MPS server.

This can be easily seen from your first screenshot where the t1034 processes are listed as using the GPU.

On the contrary, in pre-Volta architectures, the client processes communicate with the GPU through the MPS server. This results in seeing only the MPS server process communicating directly with the GPU in the latter screenshot.