I'm trying to analyze the "cost" of a certain part of the computation performed by a CUDA kernel of mine. There is, of course, the use of profiling; but what I'm trying to determine is how much I could gain by investing in speeding up, or replacing, that certain part.
I want to make that estimation by determining how the kernel behaves when I cut out that part - not replacing it with something more efficient, just skipping it. So, my kernel might look like (semi-pseudocode, although technically valid CUDA C++):
// -- snip --
auto data = read_data(input);
auto massaged_data = massage(data);
auto result = heavy_lifting(massaged_data); // (*)
update(output, result);
and I replace the (*)
line with
auto result = dummy_result();
which default-initializes, or sets to 0's, etc.
Unfortunately, ,this approach does not work, because either the nvcc frontend or ptxas optimize away most code related to the data that's not used: Most of read_data()
and most of massage()
go away, and perhaps even all of them.
I also tried marking some variables and fields volatile - either the result, or intermediate values during its computation, or both; but this didn't help.
My question: Can I force massaged_data
to be computed, despite it never being used later, and without otherwise disabling any compiler optimizations?
Note: If you believe this can be done with volatile
, but that perhaps I did not properly volatile-up my program, this would be a valid answer, but try to give a concrete example of how to actually do it right, and mention potential "gotchas" if you know of any.
Yes. Paulius Micikevicius proposed a design pattern in the early days of CUDA performance analysis that looks something like this:
Because the result of
dobigcalculation
is connected to a global memory write, and because benchmark doesn't resolve to an immediate constant at compile time, the optimized won't identify anything as dead code and strip it out.Is it perfect? No.
You can potentially change the register profile of the kernel and you are adding at least one extra argument, which might change how the compiler builds your code. The code reordering heuristics can also make it hard to replicate the true flow of unmodified and modified code in some cases, meaning you really need to checked the PTX which the compiler emits. And while the design pattern is trivial for cases where
dobigcalculation
returns a simple POD type, it can be harder to devise an assignment operation to truly ensure that parts of your code are not silently stripped out if the function is operating on a class or a reference, rather than a return value.But it is about the best you can do to measure the impact of a given code stanza in fully optimized code.