Can I force certain computations to occur despite their result not being used in the kernel?

98 views Asked by At

I'm trying to analyze the "cost" of a certain part of the computation performed by a CUDA kernel of mine. There is, of course, the use of profiling; but what I'm trying to determine is how much I could gain by investing in speeding up, or replacing, that certain part.

I want to make that estimation by determining how the kernel behaves when I cut out that part - not replacing it with something more efficient, just skipping it. So, my kernel might look like (semi-pseudocode, although technically valid CUDA C++):

// -- snip --
auto data = read_data(input);
auto massaged_data = massage(data);
auto result = heavy_lifting(massaged_data); // (*)
update(output, result);

and I replace the (*) line with

auto result = dummy_result();

which default-initializes, or sets to 0's, etc.

Unfortunately, ,this approach does not work, because either the nvcc frontend or ptxas optimize away most code related to the data that's not used: Most of read_data() and most of massage() go away, and perhaps even all of them.

I also tried marking some variables and fields volatile - either the result, or intermediate values during its computation, or both; but this didn't help.

My question: Can I force massaged_data to be computed, despite it never being used later, and without otherwise disabling any compiler optimizations?

Note: If you believe this can be done with volatile, but that perhaps I did not properly volatile-up my program, this would be a valid answer, but try to give a concrete example of how to actually do it right, and mention potential "gotchas" if you know of any.

1

There are 1 answers

4
talonmies On

Can I force massaged_data to be computed, despite it never being used later, and without otherwise disabling any compiler optimizations?

Yes. Paulius Micikevicius proposed a design pattern in the early days of CUDA performance analysis that looks something like this:

__global__ void mykernel(....., T *dummy, int benchmark=0)
{

    .....

    // Some code

    .....

    // Code you want to conditionally disable for benchmarking
    T result;
    if (benchmark > 0) {
         result = dobigcalculation();
    }

    // More code

    if (benchmark > 0) *dummy = result;

}

Because the result of dobigcalculation is connected to a global memory write, and because benchmark doesn't resolve to an immediate constant at compile time, the optimized won't identify anything as dead code and strip it out.

Is it perfect? No.

You can potentially change the register profile of the kernel and you are adding at least one extra argument, which might change how the compiler builds your code. The code reordering heuristics can also make it hard to replicate the true flow of unmodified and modified code in some cases, meaning you really need to checked the PTX which the compiler emits. And while the design pattern is trivial for cases where dobigcalculation returns a simple POD type, it can be harder to devise an assignment operation to truly ensure that parts of your code are not silently stripped out if the function is operating on a class or a reference, rather than a return value.

But it is about the best you can do to measure the impact of a given code stanza in fully optimized code.