I have a CUDA kernel with a bunch of loops I want to unroll. Right now I do:
void mykernel(int* in, int* out, int baz) {
#pragma unroll
for(int i = 0; i < 4; i++) {
foo();
}
/* ... */
#pragma unroll
for(int i = 0; i < 6; i++) {
bar();
}
}
et cetera. I want to tell (hint at) my C/C++ compiler to unroll all of these loops, without needing a separate hint for each loop. However, I don't want to unroll all loops in all code in the file, just in this function.
If this were GCC, I could do:
__attribute__((optimize("unroll-loops")))
void mykernel(int* in, int* out, int baz) {
for(int i = 0; i < 4; i++) {
foo();
}
/* ... */
for(int i = 0; i < 6; i++) {
bar();
}
}
Or use option pushing-and-popping. Is there something equivalent I can do with CUDA?
#pragma unroll
is the only mechanism for requesting unrolling that is documented in the CUDA C Programming Guide 5.5, and it must be specified before each loop. But the compiler unrolls all "small loops with a known trip count" by default, so you may not need the unroll directives in your first example.I don't think controlling unrolling at the function level would be all that useful. You should probably initially rely on the compiler to select the best amount of unrolling and then tweak each loop separately if profiling indicates that it could help.