Better pipeline utilization causes lower performance in GPUs

99 views Asked by At

I'm developing a simple OpenCL kernel, which is only doing computation with no memory access at all. Here is the kind of kernel we are executing on the GPU:

__kernel void WGS512MAPI8LLXOPS64(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) {
         const int XGL = get_global_id(0);
         const int XGRid = get_group_id(0);
         const int XGRnum = get_num_groups(0);
         const int XLSize = get_local_size(0);
         const int XLid = get_local_id(0);
         // Just a private variable
         float temp1 = 1.0;
         float temp2 = 1.0;
         float temp3 = 1.0;
         float temp4 = 1.0;
         float tempOut;
         float MF = (float) M;
         float NF = (float) N;
         float PF = (float) P;


         // Start of a new level of for loop
         long baseIndex1 = XGRid*XLSize*8+XLid;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         ...
         temp1 += temp1 * MF;
         tempOut = temp1 + temp2 + temp3 + temp4;
         GOut[XGRid*XLSize*8+XLid] = tempOut;
}

The total number of "FMA operations" is about 1024. Based on the kernel, every instruction require the previous instruction to be finished first, due to existence of data dependency. I have tried to optimize execution of above kernel, but utilizing more temp variables to increase the gap between data dependent operations like below:

__kernel void WGS512MAPI8LLXOPS64(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) {
         const int XGL = get_global_id(0);
         const int XGRid = get_group_id(0);
         const int XGRnum = get_num_groups(0);
         const int XLSize = get_local_size(0);
         const int XLid = get_local_id(0);
         // Just a private variable
         float temp1 = 1.0;
         float temp2 = 1.0;
         float temp3 = 1.0;
         float temp4 = 1.0;
         float tempOut;
         float MF = (float) M;
         float NF = (float) N;
         float PF = (float) P; 

         // Start of a new level of for loop
         long baseIndex1 = XGRid*XLSize*8+XLid;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         ...
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         tempOut = temp1 + temp2 + temp3 + temp4;
         GOut[XGRid*XLSize*8+XLid] = tempOut;
}

Executing and calculating the total the total GFLOPs of both kernel (while the total number of operations for both are the same), the first kernel gives around 1186.17 GFLOPs and the second kernel gives around 600.58 GFLOPS, which is completely against my first assumption. Since I have rolled out memory access completely, I cannot come up with another explanation for this numbers.

So I'm wondering if anyone knows what's going on on the device level while these kernels are being executed.

0

There are 0 answers