CUDA Dynamic Parallelizm; stream synchronization from device

Question

CUDA Dynamic Parallelizm; stream synchronization from device

1.4k views Asked by user3100782 At 29 December 2024 at 05:50

I am basically looking for a way to synchronize a stream from within the device. I want to avoid using cudaDeviceSynchronize(), as it would serialize execution of my kernel that I want to execute concurrently using streams;

More detailed description: I have written a kernel, that is a stabilized bi-conjugate gradient solver. I want to lunch this kernel concurrently on different data using streams.

This kernel uses cublas functions. They are called from within the kernel.

One of operations required by the solver is calculation of a dot product of two vectors. This can be done with cublasdot(). But as this call is synchronous, execution of kernels in different streams get serialized. Instead of calling a dot product function, I calculate the dot product using cublasspmv(), which is called asynchronously. The problem is that this function returns before the result is calculated. I want therefore to synchronize the stream from the device - I am looking for an equivalent of cudaStreamSynchronize() but callable from the device.

__device__ float _cDdot(cublasHandle_t & cublasHandle, const int n, real_t * x, real_t * y) {
      float *norm; norm = new float; 
      float alpha = 1.0f; float beta = 0.0f;

      cublasSgemv_v2(cublasHandle, CUBLAS_OP_N ,1 , n, &alpha, x, 1, y, 1, &beta, norm, 1);

      return *norm;
}

What can I do to make sure, that the result is calculated before the function returns? Of course insertion of cudaDeviceSynchronize() works, but as I mentioned, it serializes the execution of my kernel across streams.

Original Q&A

There are 1 answers

**Robert Crovella** · Answer 1 · 2013-12-19T06:32:21+00:00

Probably if you read the programming guide dynamic parallelism section carefully (especially streams, events, and synchronization), you may get some ideas. Here's what I came up with:

There is an implicit NULL stream (on the device) associated with the execution sequence that calls your _cDdot function (oddly named, IMHO, since you're working with float quantities in that case, i.e. using Sgemv). Therefore, any cuda kernel or API call issued after the call to cublasSgemv_v2 in your function should wait until any cuda activity associated with the cublasSgemv_v2 function is complete. If you insert an innocuous cuda API call, or else a dummy kernel call, after the call to cublasSgemv_v2, it should wait for that to be complete. This should give you the thread-level synchronization you are after. You might also be able to use a cudaEventRecord call followed by a cudaStreamWaitEvent call.

Here's an example to show the implicit stream synchronization approach:

#include <stdio.h>
#include <cublas_v2.h>
#define SZ 16

__global__ void dummy_kernel(float *in, float *out){
  *out = *in;
}

__device__ float _cDdot(cublasHandle_t & cublasHandle, const int n, float * x, float * y, const int wait) {
      float *norm; norm = new float;
      float alpha = 1.0f; float beta = 0.0f;
      *norm = 0.0f;
      cublasSgemv_v2(cublasHandle, CUBLAS_OP_N ,1 , n, &alpha, x, 1, y, 1, &beta, norm, 1);
      if (wait){
        dummy_kernel<<<1,1>>>(norm, norm);
        }
      return *norm;
}


__global__ void compute(){
  cublasHandle_t my_h;
  cublasStatus_t status;
  status = cublasCreate(&my_h);
  if (status != CUBLAS_STATUS_SUCCESS) printf("cublasCreate fail\n");
  float *x, *y;
  x = new float[SZ];
  y = new float[SZ];
  for (int i = 0; i < SZ; i++){
    x[i] = 1.0f;
    y[i] = 1.0f;}
  float result = _cDdot(my_h, SZ, x, y, 0);
  printf("result with no wait = %f\n", result);
  result = _cDdot(my_h, SZ, x, y, 1);
  printf("result with wait = %f\n", result);
}

int main(){

  compute<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}

compile with:

nvcc -arch=sm_35 -rdc=true -o t302 t302.cu -lcudadevrt -lcublas -lcublas_device

results:

$ ./t302
result with no wait = 0.000000
result with wait = 16.000000
$

Unfortunately I tried a completely empty dummy_kernel; that did not work, unless I compiled with -G. So the compiler may be smart enough to optimize out a complete empty child kernel call.

TechQA.

CUDA Dynamic Parallelizm; stream synchronization from device

There are 1 answers

Related Questions in CUDA

Related Questions in CUBLAS

Popular Questions

Popular Tags

Trending Questions