CUDA kernels are not overlapping

2.5k views Asked by At

I have a simple vector multiplication kernel, which I am executing for 2 streams. But when I profile in NVVP, kernels do not seem to overlap. Is it because each kernel execution utilizes %100 of GPU, if not what can be the cause ?

enter image description here

Source code :

#include "common.h"
#include <cstdlib>
#include <stdio.h>
#include <math.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"
#include <string.h>

const int N = 1 << 20;

__global__ void kernel(int n, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = x[i] * y[i];
}

int main()
{

    float *x, *y, *d_x, *d_y, *d_1, *d_2;
    x = (float*)malloc(N*sizeof(float));
    y = (float*)malloc(N*sizeof(float));

    cudaMalloc(&d_x, N*sizeof(float));
    cudaMalloc(&d_y, N*sizeof(float));
    cudaMalloc(&d_1, N*sizeof(float));
    cudaMalloc(&d_2, N*sizeof(float));

    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_1, x, N*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_2, y, N*sizeof(float), cudaMemcpyHostToDevice);

    const int num_streams = 8;

    cudaStream_t stream1;
    cudaStream_t stream2;

    cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);

    cudaEvent_t start, stop;
    float elapsedTime;

    cudaEventCreate(&start);
    cudaEventRecord(start, 0);

    for (int i = 0; i < 300; i++) {
        kernel << <512, 512, 0, stream1 >> >(N, d_x, d_y);
        kernel << <512, 512, 0, stream2 >> >(N, d_1, d_2);
    }

    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);
    // cudaDeviceSynchronize();

    cudaEventCreate(&stop);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&elapsedTime, start, stop);
    printf("Elapsed time : %f ms\n", elapsedTime);

    cudaDeviceReset();
    cudaProfilerStop();
    return 0;
}

EDIT: From comments I understand each kernel is utilizing GPU fully, so what is the best approach for achieving 262144-sized vector multiplication (for multiple streams) ?

My device information :

CUDA Device Query...
There are 1 CUDA devices.

CUDA Device #0
Major revision number:         5
Minor revision number:         0
Name:                          GeForce GTX 850M
Total global memory:           0
Total shared memory per block: 49152
Total registers per block:     65536
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     1024
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   2147483647
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    901500
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     5
Kernel execution timeout:      Yes
1

There are 1 answers

0
Greg On BEST ANSWER

The reason why your kernels don't overlap is because your gpu is 'filled' with execution threads like @Robert Crovella mentions. Checking the Compute Capabilities chapter from the CUDA Programming Guide, there is a limit of 2048 threads per SM for your CC (5.0). You have 5 SM's so this makes it a maximum of 10240 threads that can run simultaneously on your device. You are calling 512x512=262144 threads, with just a single kernel call, and that pretty much leaves no space at all for the other kernel call.

You need to launch small enough kernels so that 2 can run concurrently on your device.

I'm not an expert on streams, but from what i've understood, if you want to run your program using streams, you need to split it up in chunks and you have to calculate a proper offset mechanism in order for your streams to be able to access their proper data. On your current code, each stream that you are launching does exactly the same calculation over exactly the same data. You have to split the data among the streams.

Other than that if you want to get the max performance you need to overlap the kernel execution with asynchronous data transfers. The easiest way to do this is to assign a scheme like the following to each of your streams like presented here

for (int i = 0; i < nStreams; ++i) {
     int offset = i * streamSize;
     cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes,        cudaMemcpyHostToDevice, stream[i]);
     kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
     cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, cudaMemcpyDeviceToHost, stream[i]);
}

This configuration simply tells each stream to do a memcpy then to execute the kernel on some data then to copy the data back. After the async calls, the streams will work simultaneously completing their tasks.

PS: I would also recommend to revise your kernel as well. Using one thread to compute just one multiplication is an overkill. I would use the thread to process some more data.