why cuda kernel function costs cpu?

865 views Asked by At

I found in some particular situation, cuda kernel function can cost cpu.

I'm working on win7(32bit) + vs2013(sp4) + cuda 6.5 + GTX 650

my code looks like:

cudaMalloc(args...); // malloc buffer outside the busy loop
while(1) {
    Sleep(1);
    kernel<<<B, T>>>(args...);
}

// and the kernel function(uses 30 registers) 
__global__ void kernel(args...) {
    char l_plain[256] = {0}; // use local variable
    // copy memory from argument buffer to l_plain
    // to avoid using global memory in the loop below

    // it costs CPU when the ROUND is very huge, like 1024
    for(int i=0; i<ROUND; i++) {
        uint u1, u2, u3, u4;
        md5_vfy(l_plain, 16, &u1, &u2, &u3, &u4); // a __device__ function that calculates md5 hash
        // prepare for next plain text
    }
}

I can verify that the <<<B, T>>> are well optimized, actually I use <<<32, 128>>>, the occupancy is almost 100% in Visual Profiler.

the function md5_vfy can be found here: http://pastebin.com/KU3zUxpb

On my machine, when the loop ROUND is small than 720, the cpu is always free, it costs 0% in task manager. And with changing the loop ROUND to 750/800/900/1000... the cpu cost also grows linearly.

I want to know what makes the difference, what costs the cpu when the ROUND is very large? I can provide the Visual Profiler screenshot.

Update Reason

I want to know this because I want to reduce all cpu cost. My program is supposed to support two modes: normal mode and game mode. Normal mode takes 100% cpu and gpu, maybe runs 800m hash per second. Game mode takes 0% cpu and 5-xx% gpu that runs 50m hash per second.

Thanks.

1

There are 1 answers

3
Robert Crovella On BEST ANSWER

Presumably as you increase ROUND, the kernel execution time takes longer.

Considering your main loop:

while(1) {
    Sleep(1);
    kernel<<<B, T>>>(args...);
}

It appears that the windows Sleep() function suspends thread execution for the specified period in milliseconds. Therefore, whether or not kernel in the above loop takes more or less than 1 millisecond to execute will be a very important factor in the dynamic behavior of your program.

Suppose that the execution time is less than 1 ms. In that case, each time the CPU thread returns from Sleep, the previous kernel invocation is finished, and a new kernel call begins to execute more-or-less immediately. Since kernel calls are asynchronous, control is returned to the CPU thread, which then goes back to sleep, and the overal CPU usage by this thread will be very low, on average.

Now let's consider the case where the kernel execution time exceeds 1 ms (perhaps because ROUND is larger). In that case, your main loop issues a kernel call, goes to sleep, wakes up 1 ms later, but the previously issued kernel is still executing. No problem, the CPU thread can still issue another kernel call, which will go into a waiting queue. Control is returned to the CPU thread, and the CPU thread then goes back to sleep. However, over time (and pretty rapidly, for a 1ms sleep period) this "excess" will add up, and eventually the queue will have two pending launches in it, then 3 pending launches in it, etc. Eventually the pending launch queue will be full.

At this point, the behavior changes dramatically. New kernel launches with a full launch-pending-queue are no longer asynchronous, and the CPU thread will block at that point, waiting for a queue slot to open up. At this point, the CPU utilization will go to 100% while the thread is busy-waiting for a queue slot to open up. Once a queue slot opens up, it will issue the next kernel call (which will take it's place in the newly-opened queue slot), and go to sleep.

At this point as you further increase the length of kernel execution, the time that the CPU spends busy-waiting (100% utilization) vs. in sleep (0% utilization) will ratiometrically change as you increase the length of kernel execution.

I'm not entirely sure what your goal is, but to avoid this, you might try increasing the Sleep() period to roughly match (i.e. be somewhat larger than) your kernel execution time, as you increase the ROUND value.