Is Intel Xeon Phi used intrinsics get good performance than Auto-Vectorization?

1.1k views Asked by At

Intel Xeon Phi provides using the "IMCI" instruction set ,
I used it to do "c = a*b" , like this:

float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
    __m512 x_1Vec = _mm512_load_ps(x+i);
    __m512 y_1Vec = _mm512_load_ps(y+i);

    __m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
    _mm512_store_pd(z+i,ans);

}

And test it's performance , when the N SIZE is 1048576,
it need cost 0.083317 Sec , I want to compare the performance with auto-vectorization
so the other version code like this:

_Cilk_for(size_t i = 0; i < N; i++)
    z[i] = x[i] * y[i];

This version cost 0.025475 Sec(but sometimes cost 0.002285 or less, I don't know why?)
If I change the _Cilk_for to #pragma omp parallel for, the performance will be poor.

so, if the answer like this, why we need to use intrinsics?
Did I make any mistakes any where?
Can someone give me some good suggestion to optimize the code?

2

There are 2 answers

3
Arch D. Robison On

The measurements don't mean much, because of various mistakes.

  • The code is storing 16 floats as 8 doubles. The _mm512_store_pd should be _mm512_store_ps.
  • The code is using _mm512_store_... on an unaligned location with address z+i, which may cause a segmentation fault. Use __declspec(align(64)) to fix this.
  • The arrays x and y are not initialized. That risks introducing random numbers of denormal values, which might impact performance. (I'm not sure if this is an issue for Intel Xeon Phi).
  • There's no evidence that z is used, hence the optimizer might remove the calculation. I think it is not the case here, but it's a risk with trivial benchmarks like this. Also, allocating a large array on the stack risks stack overflow.
  • A single run of the examples is probably a poor benchmark, because the time is probably dominated by fork/join overheads of the _Cilk_for. Assuming 120 Cilk workers (the default for 60 4-way threaded cores), there is only about 1048576/120/16 = ~546 iterations per worker. With a clock rate over 1 GHz, that won't take long. In fact, the work in the loop is so small that most likely some workers never get a chance to steal work. That might account for why the _Cilk_for outruns OpenMP. In OpenMP, all the threads must take part in a fork/join for a parallel region to finish.

If the test were written to correct all the mistakes, it would essentially be computing z[:] = x[:]*y[:] on a large array. Because of the wide vector units on Intel(R) Xeon Phi(TM), this becomes a test of memory/cache bandwidth, not ALU speed, since the ALU is quite capable of outrunning memory bandwidth.

Intrinsics are useful for things that can't be expressed as parallel/simd loops, typically stuff needing fancy permutations. For example, I've used intrinsics to do a 16-element prefix-sum operation on MIC (only 6 instructions if I remember correctly).

0
zam On

My answer below equally applies to Intel Xeon and Intel Xeon Phi.

  1. Intrinsics-bases solution is most "powerful" just "like" assembly coding is.
    • but on the negative side, intrinsics-based solution is usually not (most) portable, not "productivity"- oriented approach and is often non-applicable for established "legacy" software codebases.
    • plus it often requires programmer to be low-level and even micro-architecture expert.
  2. However there are approaches alternate to intrinsics/assembly coding. They are:
    • A) auto-vectorization (when compiler recognizes some patterns and automatically generate vector code)
    • B) "explicit" or user-guided vectorization (when programmer provide some guidance to compiler in terms of what to vectorize and under which conditions, etc; explicit vectorization usually implies using keywords or pragmas)
    • C) Using VEC clasess or other kind of intrinsics wrapper libraries or even very specialized compilers. In fact, 2.C is often as bad as intrinsics coding in terms of productivity and legacy code incremental updates)

In your second code snippet you seem to use "explicit" vectorization, which is currently achievable when using Cilk Plus and OpenMP4.0 "frameworks" supported by all recent versions of Intel Compiler and also by GCC4.9. (I said that you seem to use explicit vectorization, because Cilk_for was originally invented for the purpose of multi-threading, however most recent version of Intel Compiler might automatically parallelize and vectorize the loop, when cilk_for is used)