I've decided to write an algorithm to utilize parallel aggregation. Here is the single threaded code that I want to transform.
vector<vector<double>> sum;
for (const auto* fold : _obj.GetFolds())
     sum.push_back(move(vector<double>(fold->GetSize())));
for (int index : sequence)
{
     vector<vector<double>> values = Calculate(vec1[index], vec2[index]);
     for (int i = 0; i < sum.size(); i++)
     {
          for (int j = 0; j < sum[i].size(); j++)
               sum[i][j] += values[i][j];
     }
}
I looked at the MSDN page http://msdn.microsoft.com/en-us/library/gg663540.aspx which covers parallel_for with combinable, and http://msdn.microsoft.com/en-us/library/dd470426.aspx#map_reduce_example showing parallel_transform with parallel_reduce, but they are simple examples with only a counter.
vector<int> sequence = ...
  combinable<int> count([]() { return 0; });     
  parallel_for_each(sequence.cbegin(), sequence.cend(), 
    [&count](int i)
    {
        count.local() += IsPrime(i) ? 1 : 0;
    });
  return count.combine(plus<int>());
I'm having a difficult time finding examples where I would aggregate with a parallel loop the vector<vector<double>> sum as outlined above.
Also, I'm looking for advice on whether to use parallel_for and combinable or parallel_transform with parallel_reduce? The first link above states:
The parallel_reduce function is usually the recommended approach whenever you need to apply the Parallel Aggregation pattern within applications that use PPL. Its declarative nature makes it less prone to error than other approaches, and its performance on multicore computers is competitive with them. Implementing parallel aggregation with parallel_reduce doesn't require adding locks in your code. Instead, all the synchronization occurs internally. Of course, if parallel_reduce doesn't meet your needs or if you prefer a less declarative style of coding, you can also use the combinable class with parallel_for or parallel_for_each to implement the parallel aggregation.
You should be aware that parallel_for and parallel_for_each add overhead due to their support of features such as cancellation and dynamic range stealing. Also, a call to the combinable::local() method inside of a parallel loop adds the cost of a hash table lookup to each iteration of the loop. In general, use parallel aggregation to increase performance when iterations perform complex computations.
Thanks.