Row and Column-Major in opencl and pyopencl

1.1k views Asked by At

I am new to opencl, and maybe I am going to ask dumb question !

I encountered some issues adapting a c/opencl program in python/pyopencl. In particular, I am a little bit confuse about the use of column-major and row-major orders.

Considering a matrix M[i,j], of Ni*Nj dimensions, the :

  • Column-major order is defined as : i + Ni*j

  • Row-major order is defined as : j + Nj*i

Using this kernel for a local size of (3,3) :

__kernel void major(__global double local_i_i_major,
                    __global double local_j_i_major,
                    __global double local_i_j_major,
                    __global double local_j_j_major)
{
int i = get_global_id(0) ;
int j = get_global_id(1) ;

int Ni = get_num_groups(0) * get_local_size(0) ;
int Nj = get_num_groups(1) * get_local_size(1) ;

int main_i_major = j + i*Nj ;
int main_j_major = i + j*Ni ;

local_i_i_major(main_i_major) = get_local_id(0) ;
local_j_i_major(main_i_major) = get_local_id(1) ;

local_i_j_major(main_j_major) = get_local_id(0) ;
local_j_j_major(main_j_major) = get_local_id(1) ;
}

We can look at the distribution of the local indexes.

When using Column-major, we obtain :

|-------------------------- (i,j) ----------------------|
_________________________________________________________
| (0,0)  |  (1,0)  |  (2,0)  | (0,0)  |  (1,0)  |  (2,0) |
| (0,0)  |  (1,0)  |  (2,0)  | (0,0)  |  (1,0)  |  (2,0) |
| (0,0)  |  (1,0)  |  (2,0)  | (0,0)  |  (1,0)  |  (2,0) |
| (0,1)  |  (1,1)  |  (2,1)  | (0,1)  |  (1,1)  |   ...  |
| (0,1)  |  (1,1)  |  (2,1)  | (0,1)  |   ...   |   ...  |
| (0,1)  |  (1,1)  |  (2,1)  |   ...  |   ...   |   ...  |
_________________________________________________________

And, when we use Row-major, we got:

|-------------------------- (i,j) ----------------------|
_________________________________________________________
| (0,0)  |  (0,1)  |  (0,2)  |  (0,0) |  (0,1)  |  ...  |
| (1,0)  |  (1,1)  |  (1,2)  |  (1,0) |   ...   |  ...  |
| (2,0)  |  (2,1)  |  (2,2)  |  (2,0) |   ...   |  ...  |
| (0,0)  |  (0,1)  |  (0,2)  |  (0,0) |  (0,1)  |  ...  |
| (1,0)  |  (1,1)  |  (1,2)  |  (1,0) |   ...   |  ...  |
| (2,0)  |  (2,1)  |  (2,2)  |  (2,0) |   ...   |  ...  |
_________________________________________________________

Of course, these distributions are different. In particular, I do not understand the local index distribution in the case of the column-major order. Some work items seem to have the same id ? Is it possible ?

When I read litterature about C/openCL, the column-major order is used most of the time. When I read Python/PyOpencl exemples, this is the row-major order which is used.

Considering that both Python and C use raw-major order, why this difference exists ?

Also, what about performance ? Is it better to use column-major or row-major order ?

Is it possible to change the way values are ordered in opencl ?

2

There are 2 answers

1
Lubo Antonov On

You are confusing the idea of memory layout with workgroup dimensions. OpenCL defines the abstract subdivision of the work space in up to 3 dimensions. They don't have to correspond to any specific memory layout. The optimal memory layout depends on the specific algorithm that you are implementing. However OpenCL doesn't do the mapping of work items to memory - you do that in your kernel through memory access operations.

The OpenCL driver will logically (since it is actually in parallel) iterate through the workgroup dimensions, but the order is not specified in the standard, since it depends on the architecture. Within a workgroup, all work items can be thought as executing in parallel (although in reality, they might not be). But even they don't represent a particular memory layout - the local dimensions might be (16,1) for example, but you could be accessing a 4x4 area in memory.

The optimal mapping depends on the type of device (GPU/FPGA vs CPU), because of their different architectures.

To summarize, the two aspects - memory layout and logical dimensions (or domain decomposition) - can't be decided in a general case; they depend on the algorithm that you are implementing.

The problems that you are having with your particular kernel are because you are nonsensically mixing the logical indices and then using them as physical. If you step by hand through your code you will see why you get duplicate entries in your output.

BTW, your code doesn't look like a real kernel - surely the parameters are supposed to be pointers; and then you access them with a brace notation (), I guess.

0
42n4 On

As far as I know if you have column-major ordered kernels (working on columns joint into one OpenCL buffer) from C/OpenCL, then the only solution is to make transposed fortran matrix from array when in rows you have some samples (created by rowflux.append(sample)) to analize:

columnflux = np.transpose(np.asarray(rowflux, dtype=np.float64, order='F'))
...
_knl = mavg_k.yourcolumnorientedkernel
_knl.set_scalar_arg_dtypes([None,np.uint32,np.uint32,None])
_knl(queue,(globalSize,),(maxWorkGroupSize,),columnfluxbuffer,w,h,outputbuffer)

Otherwise you can write row-major matrix oriented kernels for numpy matrices (when pyOpenCL cl.Buffer creates one OpenCL buffer line from matrix rows joint one after one).