How can I access the numeric stream IDs seen in nvprof, using a cudaStream_t?

1.8k views Asked by At

In nvprof I can see the stream IDs for each cuda execution stream I am using (0, 13, 15, etc.)

Given a stream variable, I'd like to be able to print out the stream ID. Currently I cannot find any API to do this and casting the cudaStream_t to an int or uint does not yield a reasonable ID. sizeof() says cudaStream_t is 8 bytes.

1

There are 1 answers

1
Robert Crovella On BEST ANSWER

Briefly: I don't know of a method to access these IDs directly, but you can give streams explicit names for profiling purposes.


cudaStream_t is an opaque "resource handle" type. A resource handle is something like a pointer; so it stands to reason that the stream ID is not contained in the pointer (handle) itself, but somehow in what it refers to.

Since it is opaque (no definition of what it points to, provided by CUDA) and as you point out there is no direct API for this, I don't think you'll find a method to extract the stream ID from a cudaStream_t at runtime.

For these assertions that cudaStream_t is a resource handle and that it is opaque, refer to the CUDA header file driver_types.h

However, the NV Tools Extension API gives you the capability to "name" a particular stream (or other resources). This would allow you to associate a particular stream in source code with a particular name in the profiler.

Here's a trivial worked example:

$ cat t138.cu
#include <stdio.h>
#include <nvToolsExtCudaRt.h>
const long tdel = 1000000000ULL;
__global__ void tkernel(){

  long st = clock64();
  while (clock64() < st+tdel);
}

int main(){

  cudaStream_t s1, s2, s3, s4;
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
  cudaStreamCreate(&s3);
  cudaStreamCreate(&s4);
#ifdef USE_S_NAMES
  nvtxNameCudaStreamA(s1, "stream 1");
  nvtxNameCudaStreamA(s2, "stream 2");
  nvtxNameCudaStreamA(s3, "stream 3");
  nvtxNameCudaStreamA(s4, "stream 4");
#endif
  tkernel<<<1,1,0,s1>>>();
  tkernel<<<1,1,0,s2>>>();
  tkernel<<<1,1,0,s3>>>();
  tkernel<<<1,1,0,s4>>>();

  cudaDeviceSynchronize();
}

$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt
$ nvprof --print-gpu-trace ./t138
==28720== NVPROF is profiling process 28720, command: ./t138
==28720== Profiling application: ./t138
==28720== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
464.80ms  622.06ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1        13  tkernel(void) [393]
464.81ms  621.69ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1        14  tkernel(void) [395]
464.82ms  623.30ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1        15  tkernel(void) [397]
464.82ms  622.69ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1        16  tkernel(void) [399]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt -DUSE_S_NAMES
$ nvprof --print-gpu-trace ./t138
==28799== NVPROF is profiling process 28799, command: ./t138
==28799== Profiling application: ./t138
==28799== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
457.98ms  544.07ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1  stream 1  tkernel(void) [393]
457.99ms  544.31ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1  stream 2  tkernel(void) [395]
458.00ms  544.07ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1  stream 3  tkernel(void) [397]
458.00ms  544.07ms              (1 1 1)         (1 1 1)         8        0B        0B         -           -  TITAN X (Pascal         1  stream 4  tkernel(void) [399]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$