Why Xeon Phi always got bad efficacy?

1.3k views Asked by At

I tried to run a for loop 1,000,000,000 times on Xeon E5 and Xeon Phi, and measurement time to compare their efficacy, I'm so surprise I got the following result:

  • On E5 (1 Thread): 41.563 Sec
  • On E5 (24 Threads): 22.788 Sec
  • Offload on Xeon Phi (240 Threads): 45.649 Sec

Can anybody tell me that why I get the bad efficacy? About architecture or any another?

Why I got the bad efficeny on Xeon Phi? I do nothing on the for loop. If my Xeon Phi coprocessor didn't had any problem, what work for Xeon Phi is great? Must be vectorization? if not vectorization, can I do any thing on Xeon Phi use its threads to help me something?

3

There are 3 answers

3
Taylor Kidd On BEST ANSWER

The key is that you say, "I do nothing in the for loop." (Please correct me if I'm mistaken.)

Because of practical limits when the Xeon Phi was created, its cores are based upon a Pentium generation machine with various enhancements, such as dual issue, 4 threads per core, and the 512-bit vector engine. So if you are only running scalar code, it runs like a Pentium.

You need to run code that is both highly parallel and highly vectorizable. Even better if threads running on each core are able to share the core's pipeline without much contention, e.g. DGEMM, as well as take advantage of the cache structure.

By running a trivial benchmark, you are basically comparing the execution of code overhead on both your architectures (Xeon and Xeon Phi). And code overhead is typically scalar.

Here's an exaggerated illustration for us more visually inclined.

|<--Ovr-->|<--Work--------------->| repeat 10^6 times //Xeon Server

|<-----Ovr----->|<-Work->| repeat 10^6 times //Xeon Phi

Where "Ovr" is overhead, and "Work" is your highly threaded and vectorized workload.

If you have "Work" to do, then the Xeon Phi does better. If you remove the "Work", leaving only the overhead, the Xeon does better.

4
Computer architect On

Xeon Phi sucks. In moderately parallel applications traditional xeons trounce xeon Phi, in massively parallel applications GPGPUs rule. Xeon Phi is only marginally competitive when you can perfectly parallelize AND vectorize your application if either one is not perfect forget Xeon Phi.

EDIT: Some examples where xeon phi works either worse than traditional xeons or worse than GPGPUs:

blog.xcelerit.com/intel-xeon-phi-vs-nvidia-tesla-gpu/

http://www.delaat.net/awards/2014-03-26-paper.pdf

https://verc.enes.org/ISENES2/documents/Talks/WS3HH/session-4-hpc-software-challenges-solutions-for-the-climate-community/markus-rampp-mic-experiences-at-mpg

0
Vahid Noormofidi On

First, you have to utilize the entire chip, i.e., utilize SIMD units as well. Second, in order to utilize the Xeon Phi processor, the pipeline must not remain idle, i.e., there has to be always enough instruction inside the pipeline. In your benchmark no instruction is issued, so you basically measured the launch of an empty loop (which is likely optimized out by your compiler) and due to CPU's higher clock, runs faster on CPU.

In addition, in my benchmarks I found that the Xeon Phi's performance is very sensitive to the length of the innermost loop (that runs on SIMD units).