I'm using AMD-APP (1214.3). My code in OpenCL is as follows,
// W is an uint4 variable
uint4 T = (uint4)(1U, 2U, 3U, 4U);
T += W;
or I had also tried using constant data as follows,
// outside function scope
__constant uint4 X = (uint4)(1U, 2U, 3U, 4U);
// inside function
uint4 T = X;
T += W;
However, after compilation I saw the assembly code contains multiple addition instructions to form a uint vector;
dcl_literal l16, 0x00000001, 0x00000001, 0x00000001, 0x00000001
dcl_literal l19, 0x00000002, 0x00000002, 0x00000002, 0x00000002
dcl_literal l18, 0x00000003, 0x00000003, 0x00000003, 0x00000003
dcl_literal l17, 0x00000004, 0x00000004, 0x00000004, 0x00000004
mov r66, l16
iadd r66, r66.xyz0, l17.000x
iadd r66, r66.xy0w, l18.00x0
iadd r66, r66.x0zw, l19.0x00
iadd r75, r75, r66
So, how could I code for vector initialization in OpenCL to achieve fewer instruction. For example, one instruction load and then iadd, like following
dcl_literal l16, 0x00000001, 0x00000002, 0x00000003, 0x00000004
move r66, l16
iadd r75, r75, r66
Thanks for your help.
What you see in
seems to be LLVM assembler. It's an output of compiler front-end yet to be processed by back end & translated into machine code. As it's not the final version, than, in my opinion, there is no measure to determine how optimal this code is.
As suggestion - such LLVM representation may be used for better backward compatibility with legacy architectures, as it looks like VLIW instructions code.
Returning back to OpenCL performance. One IO operation takes so long, that all effort, put into smaller instruction-level optimizations is just wasting of time. That's why GPGPU performance is usually bandwidth - bound.