I am new to the Xeon Phi and so I am going through the manuals trying to understand how to improve performance on the Phi using the vector registers.
Consider the short code at the end of this question which calculates the area under the curve 4/(1+x^2) on [0,1]
using a Riemann sum. The analytic answer is pi = 3.14159....
The code basically consists of two nearly identical chunks of code which use OpenMP
to calculate the answer using 4 threads. The only difference is that in the second
chunk I am using the vectorized function __sec_reduce_add()
to compute the Riemann
sum of the sub-domain of [0,1] given to the thread.
The timings for the first chunk of the code is 0.0866439 s
and for the second(vectorized) chunk it is 0.0868771 s
Why did these both yield nearly the same timings. I would have thought that using the vector register would have significantly improved the performance.
I compiled this with icc -mmic -vec-report3 -openmp
flags
[Note: I have put a for loop with the rpt
variable over the two sections, because rpt=0 and rpt=1 are "warm-up" loops and so will have somewhat higher timings. I have given the timings of the two sections at rpt=3]
#include <iostream>
#include <omp.h>
using namespace std;
int main (void)
{
int num_steps = 2e8 ;
double dx = 1.0/num_steps ;
double x = 0. ;
double* fn = new double[num_steps];
// Initialize an array containing function values
for(int i=0 ; i<num_steps ;++i )
{
fn[i] = 4.0*dx/(1.0 + x*x);
x += dx;
}
for(size_t rpt=0 ; rpt<4 ; ++rpt)
{
double start = omp_get_wtime();
double parallel_sum = 0.;
#pragma omp parallel num_threads(4)
{
int threadIdx = omp_get_thread_num();
int begin = threadIdx * num_steps/4 ; //integer index of left-end point of sub-interval
int end = begin + num_steps/4 ;// integer index of right-end point of sub-interval
double dx_local = dx ;
double temp = 0 ;
double x = begin*dx ;
for (int i = begin; i < end; ++i)
{
temp += fn[i];
}
#pragma omp atomic
parallel_sum += temp;
}
double end = omp_get_wtime();
std::cout << "\nTime taken for the parallel computation: " << end-start << " seconds";
//%%%%%%%%%%%%%%%%%%%%%%%%%
start = omp_get_wtime();
double parallel_vector_sum = 0.;
#pragma omp parallel num_threads(4)
{
int threadIdx = omp_get_thread_num();
int begin = threadIdx * num_steps/4 ; //integer index of left-end point of sub-interval
int end = begin + num_steps/4 ;// integer index of right-end point of sub-interval
double dx_local = dx ;
double temp = 0 ;
double x = begin*dx ;
temp = __sec_reduce_add( fn[begin:end-begin+1] );
#pragma omp atomic
parallel_vector_sum += temp;
}
end = omp_get_wtime();
std::cout << "Time taken for the parallel vector computation: " << end-start << " seconds" ;
}// end for rpt
return 0;
}