How should I declare a vector variable in OpenCL that can fully utilize GPU's vectorized feature

503 views Asked by At

I'm using AMD-APP (1214.3). My code in OpenCL is as follows,

// W is an uint4 variable
uint4 T = (uint4)(1U, 2U, 3U, 4U);
T += W;

or I had also tried using constant data as follows,

// outside function scope
__constant uint4 X = (uint4)(1U, 2U, 3U, 4U);
// inside function
uint4 T = X;
T += W;

However, after compilation I saw the assembly code contains multiple addition instructions to form a uint vector;

dcl_literal l16, 0x00000001, 0x00000001, 0x00000001, 0x00000001
dcl_literal l19, 0x00000002, 0x00000002, 0x00000002, 0x00000002
dcl_literal l18, 0x00000003, 0x00000003, 0x00000003, 0x00000003
dcl_literal l17, 0x00000004, 0x00000004, 0x00000004, 0x00000004
    mov r66, l16
    iadd r66, r66.xyz0, l17.000x
    iadd r66, r66.xy0w, l18.00x0
    iadd r66, r66.x0zw, l19.0x00
    iadd r75, r75, r66

So, how could I code for vector initialization in OpenCL to achieve fewer instruction. For example, one instruction load and then iadd, like following

dcl_literal l16, 0x00000001, 0x00000002, 0x00000003, 0x00000004
   move r66, l16
   iadd r75, r75, r66

Thanks for your help.

1

There are 1 answers

0
Roman Arzumanyan On

What you see in

dcl_literal l16, 0x00000001, 0x00000001, 0x00000001, 0x00000001
...

seems to be LLVM assembler. It's an output of compiler front-end yet to be processed by back end & translated into machine code. As it's not the final version, than, in my opinion, there is no measure to determine how optimal this code is.

As suggestion - such LLVM representation may be used for better backward compatibility with legacy architectures, as it looks like VLIW instructions code.

Returning back to OpenCL performance. One IO operation takes so long, that all effort, put into smaller instruction-level optimizations is just wasting of time. That's why GPGPU performance is usually bandwidth - bound.