OpenMP SIMD on Power8

754 views Asked by At

I'm wondering whether there is any compiler (gcc, xlc, etc.) on Power8 that supports OpenMP SIMD constructs on Power8? I tried with XL (13.1) but I couldn't compile successfully. Probably it doesn't support simd construct yet.

I could compile with gcc 4.9.1 (with these flags -fopenmp -fopenmp-simd and -O1). I put differences between 2 asm files.

Can I say that gcc 4.9 is able to generate altivec code? In order to optimize more, what am I supposed to do? (I tried with -O3, restrict treatment)

My code is very simple:

int *x, *y, *z;
x = (int*) malloc(n * sizeof(int));
y = (int*) malloc(n * sizeof(int));
z = (int*) malloc(n * sizeof(int));   

#pragma omp simd
for(i = 0; i < N; ++i)
  z[i] = a * x[i] + y[i];

And generated assembly is here

  .L7:
  lwz 9,124(31)
  extsw 9,9 
  std 9,104(31)
  lfd 0,104(31)
  stfd 0,104(31)
  ld 8,104(31)
  sldi 9,8,2
  ld 10,152(31)
  add 9,10,9
  lwz 10,124(31)
  extsw 10,10
  std 10,104(31)
  lfd 0,104(31)
  stfd 0,104(31)
  ld 7,104(31)
  sldi 10,7,2
  ld 8,136(31)
  add 10,8,10
  lwz 10,0(10)
  extsw 10,10
  lwz 8,132(31)
  mullw 10,8,10
  extsw 8,10
  lwz 10,124(31)
  extsw 10,10
  std 10,104(31)
  lfd 0,104(31)
  stfd 0,104(31)
  ld 7,104(31)
  sldi 10,7,2
  ld 7,144(31)
  add 10,7,10
  lwz 10,0(10)
  extsw 10,10
  add 10,8,10
  extsw 10,10
  stw 10,0(9)
  lwz 9,124(31)
  addi 9,9,1
  stw 9,124(31)

GCC with -O1 -fopenmp-simd

.L7:
lwz 9,108(31)
mtvsrwa 0,9
mfvsrd 8,0
sldi 9,8,2
ld 10,136(31)
add 9,10,9
lwz 10,108(31)
mtvsrwa 0,10
mfvsrd 7,0
sldi 10,7,2
ld 8,120(31)
add 10,8,10
lwz 10,0(10)
extsw 10,10
lwz 8,116(31)
mullw 10,8,10
extsw 8,10
lwz 10,108(31)
mtvsrwa 0,10
mfvsrd 7,0
sldi 10,7,2
ld 7,128(31)
add 10,7,10
lwz 10,0(10)
extsw 10,10
add 10,8,10
extsw 10,10
stw 10,0(9)
lwz 9,108(31)
addi 9,9,1
stw 9,108(31)

In order to clarify and understand details, I have one more application which is n^2 nbody application. This time my question is related with these compilers (gcc 4.9 and XL 13.1 ) and architectures (Intel and Power).

I put all the codes into gist https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-input-c ( full version of input code input.c )

  1. Power8 & XLC - It says "was not SIMD vectorized because it contains function calls. (there is sqrtf)". It's reasonable. But in the asm code I can see xsnmsubmdp is it normal? (the assembly: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-xlc-noinnersimd-asm)
  2. Power8 & gcc I tried to compile it in 2 ways (with omp simd construct and without). It changed my asm code, is it normal? (According to OpenMP, the code should not contain function call) (Assembilies: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-gcc-noinnersimd-asm & https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-power8-gcc-innersimd-asm)
  3. i74820K & gcc I did a same test with omp simd and without it. The output codes are different as well. Does FMA effect this code block ? (Assembilies: https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-i74820k-gcc-noinnersimd-asm & https://gist.github.com/grypp/8b9f0f0f98af78f4223e#file-i74820k-gcc-innersimd-asm)

Thanks in advance

2

There are 2 answers

2
Denis Palmeiro On BEST ANSWER

The XL compiler on POWER Linux currently only supports a subset of the OpenMP 4.0 features. The SIMD construct feature is not supported at the moment, so the compiler will not recognize the construct in your source code.

However, if vectorization is what you're looking for then the good news is that the XL compiler should already automatically vectorize your code as long as you use at least the following optimization options

-O3 -qhot -qarch=pwr8 -qtune=pwr8

These options will enable high-order loop transformations along with POWER8 specific optimizations, including loop auto-vectorization for your loop.

Afterwards, you should see some VMX & VSX instructions in the generated assembly code similar to the following:

 188:   19 2e 80 7c     lxvw4x  vs36,0,r5
 18c:   84 09 a6 10     vslw    v5,v6,v1
 190:   10 00 e7 38     addi    r7,r7,16
 194:   10 00 a5 38     addi    r5,r5,16
 198:   40 28 63 10     vadduhm v3,v3,v5
 19c:   80 20 63 10     vadduwm v3,v3,v4
 1a0:   19 4f 66 7c     stxvw4x vs35,r6,r9
 1a4:   14 02 86 41     beq     cr1,3b8 <foo+0x3b8>
 1a8:   10 00 20 39     li      r9,16
 1ac:   19 4e 27 7d     lxvw4x  vs41,r7,r9
 1b0:   19 3e a0 7c     lxvw4x  vs37,0,r7

By the way, you can also get an optimization report from the XL compilers by using the -qreport option. This will explain which loops were vectorized and which loops were not and for what reason. e.g.

1586-542 (I) Loop (loop index 1 with nest-level 0 and iteration count 100) at test.c was SIMD vectorized.

or

1586-549 (I) Loop (loop index 2) at test.c was not SIMD vectorized because a data dependence prevents SIMD vectorization.

Hope this helps!

6
Hristo Iliev On

I don't have access to a Power-based machine right now, but some experimentation with the AST dumper on x86 shows that GCC 4.9.2 starts producing SIMD code only once the level of optimisation reaches O1, i.e. the following options should do the trick:

-fopenmp-simd -O1

The same is true for GCC 5.1.0.

Also note that the vectoriser applies a cost model that might prevent it from actually producing vectorised code in some cases. See -fsimd-cost-model and similar options here on how to override that behaviour.