I am trying to do the following:
- Declare and initialise a
compute::vector
of sizen
with each of the element being1^2
,2^2
,3^2
...n^2
- While size of vector > 1:
a. For i = 0 to (size of vector - 1):n[i] += n[i+1]
b. Remove the last element from the vector - Square root the remaining element in the vector.
I implemented this with Boost.Compute in this way:
BOOST_COMPUTE_FUNCTION(double, square, (double x),
{
return x*x;
});
uint32_t cl_benchmark(const compute::device& gpu) {
compute::context ctx(gpu);
compute::command_queue queue(ctx, gpu);
//Starts measuring time from this line
compute::vector<double> v(size, ctx);
compute::iota(v.begin(), v.end(), 1, queue);
compute::transform( // b^2 and c^2 in one function
v.begin(), v.end(), v.begin(), square, queue
);
for (uint32_t temp_size = size; temp_size > 1; temp_size--) {
compute::adjacent_difference( // b^2 + c^2
v.begin(), v.end(), v.begin(), compute::plus<double>(), queue
);
v.erase(v.begin(), queue);
}
compute::transform( // sqrt(a)
v.begin(), v.end(), v.begin(), compute::sqrt<double>(), queue
);
//Stops measuring time
}
Which works but upon testing it is very slow:
size 6000 time 551ms
size 7024 time 1870ms
size 8048 time 8262ms
I know that it takes time to compile kernels as well as copying data to the GPU, but the results still seem a bit off to me.
I implemented something similar with single-core CPU processing only and it can process this algorithm with size of ~150k in 2s.
I suspect that the for
loop is the bottleneck, where it keeps compiling new adjacent_difference
kernels. That's why I am wondering if it can be parallelised?
To my understanding, if it no longer compiles new adjacent_difference
for each iteration, the CPU bottleneck should be drastically reduced.