Linked Questions

Popular Questions

I would like to know whether the code hoisting exists on a variety of platforms including Nvidia, AMD, and Intel. I created a simple example and it seems like this feature does not exist. Since I'm still new to openCL so I don't know if I test it correctly. The example code just perform a matrix addition and add a constant to each entry, Here is the code:

//Just some complicate operation on the variable random_private
#define zero random_private[0]*random_private[1]*random_private[2]*random_private[3]*random_private[4]*random_private[5]*random_private[6]*random_private[7]*random_private[8]*random_private[9]
#define zero1 powr((double)zero*zero+zero,10)
#define zero2 zero1/(zero1+1)
#define zero3 zero2+zero2*zero2

//Test if the code hoisting exist
//C=A+B+something
kernel void matrix_add1(global  double *A,  global  double *B,global  double *C ,global uint* random) {
  uint rowNum=10000;
  uint colNum=100;
//localize the variable random to make sure the code hoisting is valid(Otherwise it is possible that the variable random can be changed by other thread when excuting the loop and therefore the code hoisting results in incorrect answer)
  uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
  for(uint j=0;j<colNum;j++){
    for(uint i=0;i<rowNum;i++){
//zero3 is a macro to do some super complicate operation on random_private
      C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+zero3;
    }
  }
}

//Manually do the code hoisting
kernel void matrix_add2(global  double *A,  global  double *B,global  double *C ,global uint* random) {
  uint rowNum=10000;
  uint colNum=100;
  uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
//Compute the loop-invariant code
  uint tmp=zero3;
  for(uint j=0;j<colNum;j++){
    for(uint i=0;i<rowNum;i++){
      C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+tmp;
    }
  }
}

The example runs 20 times with just one thread, here is the result on my computer:

Nvidia 1070:

matrix_add1: 28.46 sec

matrix_add2: 4.3 sec

AMD 1600X:

matrix_add1: 5.78 sec

matrix_add2: 0.16 sec

The function matrix_add1 is much slower than the function matrix_add2. Did I made any mistake on this example? Or is there any third-party compiler that can implement such optimization and generate the intermediate code for us? Thanks!

Related Questions