I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe
), find most-significant non-sign bit (bfind
), and population count (popc
).
Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?
For example: "population count", or popc
, means counting the one bits. So should I write:
__device__ int popc(int a) {
int d = 0;
while (a != 0) {
if (a & 0x1) d++;
a = a >> 1;
}
return d;
}
for that functionality, or should I, rather, use:
__device__ int popc(int a) {
int d;
asm("popc.u32 %1 %2;":"=r"(d): "r"(a));
return d;
}
? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?
also - does GPU have some extra magic instruction corresponding to PTX instructions?
The compiler may identify what you're doing and use a fancy instruction to do it, or it may not. The only way to know in the general case is to look at the output of the compilation in ptx assembly, by using
-ptx
flag added to nvcc. If the compiler generates it for you, there is no need to hand-code the inline assembly yourself (or use an instrinsic).Also, whether or not it makes a performance difference in the general case depends on whether or not the code path is used in a significant way, and on other factors such as the current performance limiters of your kernel (e.g. compute-bound or memory-bound).