I know of the existence of nvvp
and nvprof
, of course, but for various reasons nvprof
does not want to work with my app that involves lots of shared libraries. nvidia-smi
can hook into the driver to find out what's running, but I cannot find a nice way to get nvprof
to attach to a running process.
There is a flag --profile-all-processes
which does actually give me a message "NVPROF is profiling process 12345", but nothing further prints out. I am using CUDA 8.
How can I get a detailed performance breakdown of my CUDA kernels in this situation?
As comments suggest, you simply have to make sure to start the CUDA profiler (now it's NSight Systems or NSight Compute, no longer nvprof) before the processes you want to profile. You could, for example, configure it to run on system startup.
Your inability to profile your application has nothing to do with it being an "app that involves lots of shared libraries" - the profiling tools profile such applications just fine.