Is inline PTX more efficient than C/C++ code?

822 views Asked by At

I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc).

Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?

For example: "population count", or popc, means counting the one bits. So should I write:

__device__ int popc(int a) {
  int d = 0;
  while (a != 0) {
    if (a & 0x1)  d++;
    a = a >> 1;
  }   
  return d;
}

for that functionality, or should I, rather, use:

__device__ int popc(int a) {
    int d;
    asm("popc.u32 %1 %2;":"=r"(d): "r"(a));
    return d;
}

? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?

also - does GPU have some extra magic instruction corresponding to PTX instructions?

2

There are 2 answers

1
Robert Crovella On BEST ANSWER

The compiler may identify what you're doing and use a fancy instruction to do it, or it may not. The only way to know in the general case is to look at the output of the compilation in ptx assembly, by using -ptx flag added to nvcc. If the compiler generates it for you, there is no need to hand-code the inline assembly yourself (or use an instrinsic).

Also, whether or not it makes a performance difference in the general case depends on whether or not the code path is used in a significant way, and on other factors such as the current performance limiters of your kernel (e.g. compute-bound or memory-bound).

0
einpoklum On

A few more points in addition to @RobertCrovella's answer:

  • Even if you do use PTX for something - that should happen rarely. Limit it to small functions of no more than a few PTX lines - which you can then re-use for multiple purposes as you see fit, with most of your coding being in C/C++.
  • An example of this principle are the intrinsics @njuffa mentiond, in (that's not an official copy of the file I think). Please read it through to see which intrinsics are available to you. That doesn't mean you should use them all, of course.
  • For your specific example - you do want the PTX over the first version; it certainly won't do any harm. But, again, it is also an example of how you do not need to actually write PTX, since popc has a corresponding __popc intrinsic (again, as @njuffa has noted).
  • You might also want to have a look at the source code of some CUDA-based libraries to see what kind of PTX snippets they've chosen to use.