OpenCL and Tesla M1060

538 views Asked by At

I'm using the Tesla m1060 for GPGPU computation. It has the following specs:

# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor)  240
Memory Interface (512-bit per GPU)  512-bit

When I use OpenCL, I can display the following board information:

available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30 
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383

How can I relate the GPU card informations to the OpenCL memory informations ?

For example:

  • What does "Memory Interace" means ? Is it linked the a Work Item ?
  • How can I relate the "240 cores" of the GPU to Work Groups/Items ?
  • How can I map the work-groups to it (what would be the number of Work groups to use) ?

Thanks

EDIT:

After the following answers, there is a thing that is still unclear to me:

The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.

However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.

In the OpenCL 1.1 Api, it is written (p. 15):

Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit

It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.

As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:

Entry function uses too much shared data (0x4020 bytes, 0x4000 max).

The value 16 works.

Addendum

Here is my Kernel signature:

// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
    #pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
    #error "IEEE-754 double precision not supported by OpenCL implementation."
#endif

#define BLOCK_SIZE 16 // --> this is what defines the WG size to me

__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
  void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
  __local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
  __local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
  // stuff that does matrix multiplication with __local
}

In the host code part:

#define BLOCK_SIZE 16 
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
2

There are 2 answers

9
mfa On BEST ANSWER

The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)

The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.

The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo

I have answered a similar questions about work group sizing. see here, and here

Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.

1
mfa On

Adding another answer to address your local memory issue.

Entry function uses too much shared data (0x4020 bytes, 0x4000 max)

Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.

0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.

You will have to use a BLOCK_SIZE value less than 32 to work around this for now.

There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)