I tried to run a for loop 1,000,000,000 times on Xeon E5 and Xeon Phi, and measurement time to compare their efficacy, I'm so surprise I got the following result:
- On E5 (1 Thread): 41.563 Sec
- On E5 (24 Threads): 22.788 Sec
- Offload on Xeon Phi (240 Threads): 45.649 Sec
Can anybody tell me that why I get the bad efficacy? About architecture or any another?
Why I got the bad efficeny on Xeon Phi? I do nothing on the for loop. If my Xeon Phi coprocessor didn't had any problem, what work for Xeon Phi is great? Must be vectorization? if not vectorization, can I do any thing on Xeon Phi use its threads to help me something?
The key is that you say, "I do nothing in the for loop." (Please correct me if I'm mistaken.)
Because of practical limits when the Xeon Phi was created, its cores are based upon a Pentium generation machine with various enhancements, such as dual issue, 4 threads per core, and the 512-bit vector engine. So if you are only running scalar code, it runs like a Pentium.
You need to run code that is both highly parallel and highly vectorizable. Even better if threads running on each core are able to share the core's pipeline without much contention, e.g. DGEMM, as well as take advantage of the cache structure.
By running a trivial benchmark, you are basically comparing the execution of code overhead on both your architectures (Xeon and Xeon Phi). And code overhead is typically scalar.
Here's an exaggerated illustration for us more visually inclined.
|<--Ovr-->|<--Work--------------->| repeat 10^6 times //Xeon Server
|<-----Ovr----->|<-Work->| repeat 10^6 times //Xeon Phi
Where "Ovr" is overhead, and "Work" is your highly threaded and vectorized workload.
If you have "Work" to do, then the Xeon Phi does better. If you remove the "Work", leaving only the overhead, the Xeon does better.