OpenCL & Java - Weird Performance Results

868 views Asked by At

I'm trying to look into using OpenCL to improve the performance of some Java code using JOCL. I've been going through the samples provided on their website and used them to throw together a quick program to compare its performance to running things normally. The results I'm getting are a bit unexpected, though, and I'm concerned I may be doing something wrong.

To start with, I'm using JOCL 0.1.9, as I have an NVIDIA card that won't support OpenCL/JOCL 2.0. My computer has an Intel Core i7 CPU, an Intel HD Graphics 530 card, and an NVIDIA Quadro M2000M.

The program I've written is based on the JOCL samples; it takes two arrays of numbers and multiplies them, placing the results into a third array. I use Java's nanoTime() method to roughly track Java's observed execution time.

public class PerformanceComparison {

    public static final int ARRAY_SIZE = 1000000;

    // OpenCL kernel code
    private static String programSource = "__kernel void " + "sampleKernel(__global const float *a,"
            + "             __global const float *b," + "             __global float *c)" + "{"
            + "    int gid = get_global_id(0);" + "    c[gid] = a[gid] * b[gid];" + "}";

    public static final void main(String[] args) {
        // build arrays
        float[] sourceA = new float[ARRAY_SIZE];
        float[] sourceB = new float[ARRAY_SIZE];
        float[] nvidiaResult = new float[ARRAY_SIZE];
        float[] intelCPUResult = new float[ARRAY_SIZE];
        float[] intelGPUResult = new float[ARRAY_SIZE];
        float[] javaResult = new float[ARRAY_SIZE];

        for (int i = 0; i < ARRAY_SIZE; i++) {
            sourceA[i] = i;
            sourceB[i] = i;
        }

        // get platforms
        cl_platform_id[] platforms = new cl_platform_id[2];
        clGetPlatformIDs(2, platforms, null);

        // I know what devices I have, so declare variables for each of them
        cl_context intelCPUContext = null;
        cl_context intelGPUContext = null;
        cl_context nvidiaContext = null;
        cl_device_id intelCPUDevice = null;
        cl_device_id intelGPUDevice = null;
        cl_device_id nvidiaDevice = null;

        // get all devices on all platforms
        for (int i = 0; i < 2; i++) {
            cl_platform_id platform = platforms[i];

            cl_context_properties properties = new cl_context_properties();
            properties.addProperty(CL_CONTEXT_PLATFORM, platform);

            int[] numDevices = new int[1];
            cl_device_id[] devices = new cl_device_id[2];

            clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 2, devices, numDevices);

            // get devices and build contexts
            for (int j = 0; j < numDevices[0]; j++) {
                cl_device_id device = devices[j];

                cl_context context = clCreateContext(properties, 1, new cl_device_id[] { device }, null, null, null);

                long[] length = new long[1];
                byte[] buffer = new byte[2000];
                clGetDeviceInfo(device, CL_DEVICE_NAME, 2000, Pointer.to(buffer), length);

                String deviceName = new String(buffer, 0, (int) length[0] - 1);

                // save based on the device name
                if (deviceName.contains("Quadro")) {
                    nvidiaContext = context;
                    nvidiaDevice = device;
                }
                if (deviceName.contains("Core(TM)")) {
                    intelCPUContext = context;
                    intelGPUDevice = device;
                }
                if (deviceName.contains("HD Graphics")) {
                    intelGPUContext = context;
                    intelGPUDevice = device;
                }
            }
        }

        // multiply the arrays using Java and on each of the devices
        long jvmElapsed = runInJVM(sourceA, sourceB, javaResult);
        long intelCPUElapsed = runInJOCL(intelCPUContext, intelCPUDevice, sourceA, sourceB, intelCPUResult);
        long intelGPUElapsed = runInJOCL(intelGPUContext, intelGPUDevice, sourceA, sourceB, intelGPUResult);
        long nvidiaElapsed = runInJOCL(nvidiaContext, nvidiaDevice, sourceA, sourceB, nvidiaResult);

        // results
        System.out.println("Standard Java Runtime: " + jvmElapsed + " ns");
        System.out.println("Intel CPU Runtime: " + intelCPUElapsed + " ns");
        System.out.println("Intel GPU Runtime: " + intelGPUElapsed + " ns");
        System.out.println("NVIDIA GPU Runtime: " + nvidiaElapsed + " ns");
    }

    /**
     * The basic Java approach - loop through the arrays, and save their results into the third array
     * 
     * @param sourceA multiplicand
     * @param sourceB multiplier
     * @param result product
     * @return the (rough) execution time in nanoseconds
     */
    private static long runInJVM(float[] sourceA, float[] sourceB, float[] result) {
        long startTime = System.nanoTime();
        for (int i = 0; i < ARRAY_SIZE; i++) {
            result[i] = sourceA[i] * sourceB[i];
        }
        long endTime = System.nanoTime();
        return endTime - startTime;
    }

    /**
     * Run a more-or-less equivalent program in OpenCL on the specified device
     * 
     * @param context JOCL context
     * @param device JOCL device
     * @param sourceA multiplicand
     * @param sourceB multiplier
     * @param result product
     * @return the (rough) execution time in nanoseconds
     */
    private static long runInJOCL(cl_context context, cl_device_id device, float[] sourceA, float[] sourceB,
            float[] result) {
        // create command queue
        cl_command_queue commandQueue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, null);

        // allocate memory
        cl_mem memObjects[] = new cl_mem[3];
        memObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * ARRAY_SIZE,
                Pointer.to(sourceA), null);
        memObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * ARRAY_SIZE,
                Pointer.to(sourceB), null);
        memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * ARRAY_SIZE, null, null);

        // build program and set arguments
        cl_program program = clCreateProgramWithSource(context, 1, new String[] { programSource }, null, null);

        clBuildProgram(program, 0, null, null, null, null);

        cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);

        clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(memObjects[0]));
        clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(memObjects[1]));
        clSetKernelArg(kernel, 2, Sizeof.cl_mem, Pointer.to(memObjects[2]));

        long global_work_size[] = new long[]{ARRAY_SIZE};
        long local_work_size[] = new long[]{1};

        // Execute the kernel
        long startTime = System.nanoTime();
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
            global_work_size, local_work_size, 0, null, null);

        // Read the output data
        clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0,
            ARRAY_SIZE * Sizeof.cl_float, Pointer.to(result), 0, null, null);
        long endTime = System.nanoTime();

        // Release kernel, program, and memory objects
        clReleaseMemObject(memObjects[0]);
        clReleaseMemObject(memObjects[1]);
        clReleaseMemObject(memObjects[2]);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);

        return endTime - startTime;
    }
}

The output of the program is:

Standard Java Runtime: 3662913 ns
Intel CPU Runtime: 27186 ns
Intel GPU Runtime: 9817 ns
NVIDIA GPU Runtime: 12400512 ns

There are two things about this that confuse me:

  1. Why does the program run so much faster on the CPU when using OpenCL? It's the same equipment the JVM would be using; I know Java is slow compared to lower-level languages like OpenCL, but I didn't think it was that slow.
  2. What is wrong with the NVIDIA card? I know their support of OpenCL is less-than-stellar given their CUDA framework, but I'd still expect it to at least be faster than doing things normally. As it is, the backup, "this-is-here-in-case-you-break-your-real-graphics-card," Intel GPU is running circles around it.

I'm worried that I'm doing something wrong, or at least missing something that will allow this to work to its full potential. Any pointers I could get would be very welcome.

P.S. - I know that since I have an NVIDIA card, CUDA would likely be the better/faster option for me; however in this case I'd prefer the flexibility of OpenCL.

Update: I was able to find one thing I'd done wrong; relying on Java to report the runtime was dumb. I wrote a new test using OpenCL's profiling thing, and it's getting slightly more sensible results:

Code:

public class PerformanceComparisonTakeTwo {

    //@formatter:off
    private static final String PROFILE_TEST = 
            "__kernel void " 
            + "sampleKernel(__global const float *a,"
            + "             __global const float *b,"
            + "             __global float *c,"
            + "             __global float *d,"
            + "             __global float *e,"
            + "             __global float *f)" 
            + "{"
            + "    int gid = get_global_id(0);" 
            + "    c[gid] = a[gid] + b[gid];"
            + "    d[gid] = a[gid] - b[gid];"
            + "    e[gid] = a[gid] * b[gid];"
            + "    f[gid] = a[gid] / b[gid];"
            + "}";
    //@formatter:on
    private static final int ARRAY_SIZE = 100000000;

    public static final void main(String[] args) {
        initialize();
    }

    public static void initialize() {
        // identify all platforms
        cl_platform_id[] platforms = getPlatforms();

        Map<cl_device_id, cl_platform_id> deviceMap = getDevices(platforms);

        performProfilingTest(deviceMap);
    }

    private static cl_platform_id[] getPlatforms() {
        int[] platformCount = new int[1];
        clGetPlatformIDs(0, null, platformCount);

        cl_platform_id[] platforms = new cl_platform_id[platformCount[0]];
        clGetPlatformIDs(platforms.length, platforms, platformCount);

        return platforms;
    }

    private static Map<cl_device_id, cl_platform_id> getDevices(cl_platform_id[] platforms) {
        Map<cl_device_id, cl_platform_id> deviceMap = new HashMap<>();

        for(int i = 0; i < platforms.length; i++) {
            int[] deviceCount = new int[1];

            clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, null, deviceCount);

            cl_device_id[] devices = new cl_device_id[deviceCount[0]];

            clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, devices.length, devices, null);

            for(int j = 0; j < devices.length; j++) {
                deviceMap.put(devices[j], platforms[i]);
            }
        }

        return deviceMap;
    }

    private static void performProfilingTest(Map<cl_device_id, cl_platform_id> deviceMap) {
        float[] sourceA = new float[ARRAY_SIZE];
        float[] sourceB = new float[ARRAY_SIZE];

        for(int i = 0; i < ARRAY_SIZE; i++) {
            sourceA[i] = i;
            sourceB[i] = i;
        }

        for(Entry<cl_device_id, cl_platform_id> devicePair : deviceMap.entrySet()) {
            cl_device_id device = devicePair.getKey();
            cl_platform_id platform = devicePair.getValue();

            cl_context_properties properties = new cl_context_properties();
            properties.addProperty(CL_CONTEXT_PLATFORM, platform);

            cl_context context = clCreateContext(properties, 1, new cl_device_id[] { device }, null, null, null);

            cl_command_queue commandQueue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_PROFILING_ENABLE, null);

            cl_mem memObjects[] = new cl_mem[6];
            memObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * ARRAY_SIZE,
                    Pointer.to(sourceA), null);

            memObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * ARRAY_SIZE,
                    Pointer.to(sourceB), null);

            memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * ARRAY_SIZE, null, null);
            memObjects[3] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * ARRAY_SIZE, null, null);
            memObjects[4] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * ARRAY_SIZE, null, null);
            memObjects[5] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * ARRAY_SIZE, null, null);

            cl_program program = clCreateProgramWithSource(context, 1, new String[] { PROFILE_TEST }, null, null);

            clBuildProgram(program, 0, null, null, null, null);

            cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);

            for(int i = 0; i < memObjects.length; i++) {
                clSetKernelArg(kernel, i, Sizeof.cl_mem, Pointer.to(memObjects[i]));
            }

            cl_event event = new cl_event();

            long global_work_size[] = new long[]{ARRAY_SIZE};
            long local_work_size[] = new long[]{1};

            long start = System.nanoTime();
            clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
                    global_work_size, local_work_size, 0, null, event);

            clWaitForEvents(1, new cl_event[] {event});
            long end = System.nanoTime();

            System.out.println("Information for " + getDeviceInfoString(device, CL_DEVICE_NAME));
            System.out.println("\tGPU Runtime: " + getRuntime(event));
            System.out.println("\tJava Runtime: " + ((end - start) / 1e6) + " ms");

            clReleaseEvent(event);
            for(int i = 0; i < memObjects.length; i++) {
                clReleaseMemObject(memObjects[i]);
            }
            clReleaseKernel(kernel);
            clReleaseProgram(program);
            clReleaseCommandQueue(commandQueue);
            clReleaseContext(context);
        }

        float[] result1 = new float[ARRAY_SIZE];
        float[] result2 = new float[ARRAY_SIZE];
        float[] result3 = new float[ARRAY_SIZE];
        float[] result4 = new float[ARRAY_SIZE];

        long start = System.nanoTime();
        for(int i = 0; i < ARRAY_SIZE; i++) {
            result1[i] = sourceA[i] + sourceB[i];
            result2[i] = sourceA[i] - sourceB[i];
            result3[i] = sourceA[i] * sourceB[i];
            result4[i] = sourceA[i] / sourceB[i];
        }
        long end = System.nanoTime();

        System.out.println("JVM Benchmark: " + ((end - start) / 1e6) + " ms");
    }

    private static String getDeviceInfoString(cl_device_id device, int parameter) {
        long[] bufferLength = new long[1];
        clGetDeviceInfo(device, parameter, 0, null, bufferLength);

        byte[] buffer = new byte[(int) bufferLength[0]];
        clGetDeviceInfo(device, parameter, bufferLength[0], Pointer.to(buffer), null);

        return new String(buffer, 0, buffer.length - 1);
    }

    private static String getRuntime(cl_event event) {
        long[] start = new long[1];
        long[] end = new long[1];

        clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, Sizeof.cl_ulong, Pointer.to(start), null);
        clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, Sizeof.cl_ulong, Pointer.to(end), null);

        long nanos = end[0] - start[0];
        double millis = nanos / 1e6;
        return millis + " ms";
    }

}

Output:

Information for Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
    GPU Runtime: 639.986906 ms
    Java Runtime: 641.590764 ms
Information for Quadro M2000M
    GPU Runtime: 794.972 ms
    Java Runtime: 1191.357248 ms
Information for Intel(R) HD Graphics 530
    GPU Runtime: 1897.876624 ms
    Java Runtime: 2065.011125 ms
JVM Benchmark: 192.680669 ms

This seems to indicate that the more powerful NVIDIA card is in fact performing better then the Intel one, as I'd expected. But...

  1. Why is the CPU still faster?
  2. Why is normal Java suddenly so much faster?
1

There are 1 answers

1
Hersfold On

I'm still poking around and trying to understand this, but I'll start posting an actual answer here to benefit any other clueless newbies like me. Hopefully someone who is less clueless will come along soon to correct me on anything I'm wrong about, but at the least those other clueless newbies can see what I've worked through and learn from it.

As I noted in the edit of the question, part of the weird results was due to the fact I was relying on Java to tell me how quickly things ran. This isn't strictly wrong, I think, but I was misunderstanding the data. The Java runtime will include the time it takes Java to translate everything to and from the GPU's memory, whereas OpenCL's runtime will simply report how long it takes to run; after all, OpenCL doesn't really know or care what's calling it. Enabling the OpenCL profiling and using the events to track its runtime helped clarify this for me. This also explains the very small gap between the runtimes for the CPU; it wasn't actually switching devices, so there was no memory transfer taking place.

I also noticed that the code I had above does have a serious flaw. When enqueueing the kernel command, CL.clEnqueueNDRangeKernel accepts nine arguments. The sixth argument is called "local_work_size"; this appears to specify the number of "work groups" you want OpenCL to use to run your code. The closest analog I can think of to Java are threads; more threads (usually) means more work can be done at once (up to a point). In the code above, I was doing what the sample had shown be to do, and told OpenCL to use a single work group; basically, to run everything in a single thread. My understanding is that this is precisely the WRONG thing to do in GPGPU; the whole point of using a GPU is that it can handle many more calculations at a time than a CPU can. Forcing the GPU to do one calculation at a time defeats the point. It appears the best approach here is simply to leave that sixth argument null; this instructs OpenCL to create as many work groups it thinks are necessary. You can specify a number, but the maximum permitted number varies depending on your device (you can use CL.clGetDeviceInfo to get the CL_DEVICE_MAX_WORK_GROUP_SIZE attribute of your device to determine the absolute maximum, but it gets more complicated if you use more than one dimension).

Short version:

  1. OpenCL's profiling will give you better timing statistics than Java (However using both will help show the lag required to switch between CPU and GPU)
  2. Don't specify the local_work_size when calling CL.clEnqueueNDRangeKernel - this lets OpenCL handle the "multithreading" automatically

New results:

Information for Quadro M2000M
    GPU Runtime: 35.88192 ms
    Java Runtime: 438.165651 ms
Information for Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
    GPU Runtime: 166.278112 ms
    Java Runtime: 167.128259 ms
Information for Intel(R) HD Graphics 530
    GPU Runtime: 90.985728 ms
    Java Runtime: 239.230354 ms
JVM Benchmark: 177.824372 ms