Recently I've been looking at using TBB rather than boost.threads to speed up development. Generally the parallel_for works in most cases but I have a situation here that is a bit more complex.
There is an array of structs that needs computation that has been sorted according to a member variable. This is because the variables value relates to data that will be accessed during computation and grouping structs according to this will allow for cache coherency in a serial design.
#include <tbb/tbb.h>
#include <iostream>
struct thing
{
float value_one;
float value_two;
unsigned int sort_id;
};
class functor
{
thing* m_array;
public:
functor(thing* _array) : m_array(_array) {;}
void operator()(const tbb::blocked_range<unsigned int>& r) const
{
for(int i = r.begin(); i != r.end(); ++i)
{
//Doing a computation with array
m_array[i].value_one = m_array[i].value_two * m_array[i].value_two;
}
}
};
int main(int argc, char const *argv[])
{
unsigned int n = 10;
thing* array = new thing[n];
// Note the ordered id groups
array[0].sort_id = 1;
array[1].sort_id = 1;
array[2].sort_id = 1;
array[3].sort_id = 2;
array[4].sort_id = 3;
array[5].sort_id = 5;
array[6].sort_id = 5;
array[7].sort_id = 9;
array[8].sort_id = 9;
array[9].sort_id = 9;
// Do something parallel with array here...
// parallel_for(tbb::blocked_range<unsigned int>(0, n, 2), functor(array));
delete[] array;
return 0;
}
A simplified example is given above but in reality I'll most likely have an array of around 30-60 million elements.
I understand that parallel_for will divide the array into grouped ranges. However I would like each range to contain all the structs of a particular id. I don't mind if the range contains structs of multiple ids as long as they are sequential and contain all structs of both those ids.
int count = 0;
thing** blocks = new thing*[7];
int* sizes = new int[7];
int current_id = 0;
for(unsigned int i = 0; i < n; ++i)
{
if(array[i].sort_id != current_id)
{
current_id = array[i].sort_id;
blocks[count] = &array[i];
sizes[count] = 1;
++count;
}
else
{
sizes[count - 1] += 1;
}
}
parallel_for(tbb::blocked_range<unsigned int>(0, count, 2), functor(blocks, sizes));
Should I somehow divide the array into smaller chunks pointed to by another array that is then parallelized (as in the code directly above), and if so, what would be an efficient way to do this or is the example given optimal? Is there an alternative to parallel_for (such as task_group) that would be better suited to this problem?
The question is still not quite clear to me since you are mixing the goal and possible approaches.
If you need to sort the array, there is
parallel_sort
If you need to build an index to the sorted array where the
sort_id
given as a key mapped to the index where the given group of elements resides in the main array, useconcurrent_unordered_map
to store the groups (if there are big amount of groups) and useparallel_for
to build it up.If number of groups is less than about hundreds, you can use
std::map
orstd::unordered_map
and useparallel_reduce
in order to construct partial maps and merge them to the final state.And finally, when you'll have the right data structure with groups, you can use
parallel_for
across the groups as you want.P.S. this:
sounds really weird to me. There is a user functor (or C++11 lambda) which can be invoked in parallel in order to process different ranges
[begin;end)
. If you call the functor a 'task', it's ok but do not mix it withtbb::task
.