I'm working on a CUDA program where ALU is fully utilized (almost 100% compute throughput). The program does a lot of XOR operations, among others. Is it possible to offload the XOR to the floating-point engine? As far as I know, IMAD
instructions are not executed in the ALU, but rather in the FPU. In other words, can we replace a XOR b
with something like a*c + b
(where c
is some magic constant) or even 2-3 IMAD
(integer multiply-add) instructions?
UPDATE: in response to the comments, a
and b
are 32-bit integers.