Boost.Compute slower than plain CPU?

2.1k views Asked by At

I just started to play with Boost.Compute, to see how much speed it can bring to us, I wrote a simple program:

#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>

namespace compute = boost::compute;

int main()
{
    // generate random data on the host
    std::vector<float> host_vector(16000);
    std::generate(host_vector.begin(), host_vector.end(), rand);

    BOOST_FOREACH (auto const& platform, compute::system::platforms())
    {
        std::cout << "====================" << platform.name() << "====================\n";
        BOOST_FOREACH (auto const& device, platform.devices())
        {
            std::cout << "device: " << device.name() << std::endl;
            compute::context context(device);
            compute::command_queue queue(context, device);
            compute::vector<float> device_vector(host_vector.size(), context);

            // copy data from the host to the device
            compute::copy(
                host_vector.begin(), host_vector.end(), device_vector.begin(), queue
            );

            auto start = boost::chrono::high_resolution_clock::now();
            compute::transform(device_vector.begin(),
                       device_vector.end(),
                       device_vector.begin(),
                       compute::sqrt<float>(), queue);

            auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
            auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
            std::cout << "ans: " << ans << std::endl;
            std::cout << "time: " << duration.count() << " ms" << std::endl;
            std::cout << "-------------------\n";
        }
    }
    std::cout << "====================plain====================\n";
    auto start = boost::chrono::high_resolution_clock::now();
    std::transform(host_vector.begin(),
                host_vector.end(),
                host_vector.begin(),
                [](float v){ return std::sqrt(v); });

    auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
    auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
    std::cout << "ans: " << ans << std::endl;
    std::cout << "time: " << duration.count() << " ms" << std::endl;

    return 0;
}

And here's the sample output on my machine (win7 64-bit):

====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms

My question is: why is the plain (non-opencl) version faster?

3

There are 3 answers

2
Kyle Lutz On

As others have said, there is most likely not enough computation in your kernel to make it worthwhile to run on the GPU for a single set of data (you're being limited by kernel compilation time and transfer time to the GPU).

To get better performance numbers, you should run the algorithm multiple times (and most likely throw out the first one as that will be far greater because it includes the time to compile and store the kernels).

Also, instead of running transform() and accumulate() as separate operations, you should use the fused transform_reduce() algorithm which performs both the transform and reduction with a single kernel. The code would look like this:

float ans = 0;
compute::transform_reduce(
    device_vector.begin(),
    device_vector.end(),
    &ans,
    compute::sqrt<float>(),
    compute::plus<float>(),
    queue
);
std::cout << "ans: " << ans << std::endl;

You can also compile code using Boost.Compute with the -DBOOST_COMPUTE_USE_OFFLINE_CACHE which will enable the offline kernel cache (this requires linking with boost_filesystem). Then the kernels you use will be stored in your file system and only be compiled the very first time you run your application (NVIDIA on Linux already does this by default).

1
Roman Arzumanyan On

You're getting bad results because you're measuring time incorrectly.

OpenCL Device has it's own time counters, which aren't related to Host counters. Every OpenCL task has 4 states, timers for which can be queried: (from Khronos web site)

  1. CL_PROFILING_COMMAND_QUEUED, when the command identified by event is enqueued in a command-queue by the host
  2. CL_PROFILING_COMMAND_SUBMIT, when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
  3. CL_PROFILING_COMMAND_START, when the command identified by event starts execution on the device.
  4. CL_PROFILING_COMMAND_END, when the command identified by event has finished execution on the device.

Take into account, that timers are Device-side. So, to measure kernel & command queue performance, you can query for these timers. In your case, 2 last timers are needed.

In your sample code, you're measuring Host time, which includes data transfer time (as Skizz said) plus all time wasted on Command Queue maintenance.

So, to learn actual kernel performance, you need either to pass cl_event to your kernel (no idea how to do it in boost::compute) & query that event for performance counters or make your kernel really huge & complicated to hide all overheads.

0
Skizz On

I can see one possible reason for the big difference. Compare the CPU and the GPU data flow:-

CPU              GPU

                 copy data to GPU

                 set up compute code

calculate sqrt   calculate sqrt

sum              sum

                 copy data from GPU

Given this, it appears that the Intel chip is just a bit rubbish at general compute, the NVidia is probably suffering from the extra data copying and setting up the GPU to do the calculation.

You should try the same program but with a much more complex operation - sqrt and sum are too simple to overcome the extra overhead of using the GPU. You could try calculating Mandlebrot points for instance.

In your example, moving the lambda into the accumulate would be faster (one pass over memory vs. two passes)