Why are the timings for the vectorized reduction for a simple Riemann sum-integral on Xeon Phi so bad?

130 views Asked by At

I am new to the Xeon Phi and so I am going through the manuals trying to understand how to improve performance on the Phi using the vector registers.

Consider the short code at the end of this question which calculates the area under the curve 4/(1+x^2) on [0,1] using a Riemann sum. The analytic answer is pi = 3.14159....

The code basically consists of two nearly identical chunks of code which use OpenMP to calculate the answer using 4 threads. The only difference is that in the second chunk I am using the vectorized function __sec_reduce_add() to compute the Riemann sum of the sub-domain of [0,1] given to the thread.

The timings for the first chunk of the code is 0.0866439 s and for the second(vectorized) chunk it is 0.0868771 s

Why did these both yield nearly the same timings. I would have thought that using the vector register would have significantly improved the performance.

I compiled this with icc -mmic -vec-report3 -openmp flags

[Note: I have put a for loop with the rpt variable over the two sections, because rpt=0 and rpt=1 are "warm-up" loops and so will have somewhat higher timings. I have given the timings of the two sections at rpt=3]

#include <iostream>
#include <omp.h>
using namespace std;

int main (void)
{
  int num_steps     = 2e8           ;
  double dx        = 1.0/num_steps ;
  double x         = 0.            ;
  double* fn = new double[num_steps];

  // Initialize an array containing function values
  for(int i=0 ; i<num_steps ;++i )
    {
      fn[i] = 4.0*dx/(1.0 + x*x);
      x += dx;
    }


for(size_t rpt=0 ; rpt<4 ; ++rpt)
{
  double start = omp_get_wtime();
  double parallel_sum = 0.;
  #pragma omp parallel num_threads(4)
  {
    int threadIdx = omp_get_thread_num();
    int begin = threadIdx * num_steps/4 ; //integer index of left-end  point of sub-interval
    int end   = begin + num_steps/4     ;// integer index of right-end point of sub-interval
    double dx_local = dx                ;
    double temp = 0                     ;
    double x    = begin*dx              ;

    for (int i = begin; i < end; ++i)
      {
        temp += fn[i];
      }
    #pragma omp atomic
    parallel_sum += temp;
  }
  double end   = omp_get_wtime();
  std::cout << "\nTime taken for the parallel computation: "    << end-start << " seconds";


  //%%%%%%%%%%%%%%%%%%%%%%%%%

  start = omp_get_wtime();
  double parallel_vector_sum = 0.;
  #pragma omp parallel num_threads(4)
  {
    int threadIdx = omp_get_thread_num();
    int begin = threadIdx * num_steps/4 ; //integer index of left-end  point of sub-interval
    int end   = begin + num_steps/4     ;// integer index of right-end point of sub-interval
    double dx_local = dx                ;
    double temp = 0                     ;
    double x    = begin*dx              ;

    temp = __sec_reduce_add( fn[begin:end-begin+1]   );
    #pragma omp atomic
      parallel_vector_sum += temp;
  }
  end   = omp_get_wtime();
  std::cout << "Time taken for the parallel vector computation: "    << end-start << " seconds"  ;


 }// end for rpt
  return 0;
}
0

There are 0 answers