I am trying to atomically add a float
value to a __half
in CUDA 5.2. This architecture does support the __half
data type and its conversion functions, but it does not include any arithmetic and atomic operations for halves, like atomicAdd()
.
I created the following atomicAdd()
function wrapper with a special case for when half-precision arithmetic is unsupported. full example code
__device__ void atomic_add(__half* a, float b) {
#if __CUDA_ARCH__ >= 700 // CUDA 7.0 supports fp16 atomic add
atomicAdd(a, __float2half(b));
#else
atomicAdd(&__half2float(a), b); // Error: expression must be an lvalue
#endif
}
atomicAdd(&__half2float(a), b);
does not work, because __half2float(a)
is not an lvalue. I could make a
an lvalue by creating a copy:
float a_tmp = __half2float(a);
atomicAdd(&a_tmp , b);
a = __float2half(a_tmp);
But now the atomic function doesn't serve any purpose because I'm working on a copy of the value I actually want to modify atomically.
Is there another way that I haven't thought of in which I could perform this operation?
As it happens, compute capability 5.2 devices basically don't support 16-bit atomics of any type. There is some evidence of this is in the programming guide, and furthermore if you try to use 16-bit (
unsigned short
)atomicCAS
on an architecture less than cc7.0, you will get a compile error - its not supported, although that's not obvious from the programming guide. (Yes, I have already filed an internal bug 3845962 at NVIDIA to have the documentation improved in this respect.)The programming guide does illustrate the general formula to do
atomicCAS
based custom atomics, and we will use that recipe. However the other "ingredient" is that we are going to have to realize this with a 32-bit atomic. Generally speaking, it is possible to use a larger atomic on a smaller data type - you just don't modify anything outside of the data region of interest.But one of the requirements that comes out of this approach is that you must make sure that the atomic access will be legal. This means that you must allocate in units of 32-bits (for the 32-bit atomic) even though the type of interest is
__half
i.e. 16-bits.With that proviso the general methodology is the same as is already covered in the programming guide and other SO questions.
The following is one possible approach:
(With CUDA 11.4 at least, this methodology can work on devices all the way back to cc3.5, which is what is demonstrated above.)
FP16 has fairly limited range compared to FP32, so that is something to keep in mind when adding
float
quantities to__half
values.