I have two functions running in parallel, however they are slower than functions with normal sequential loops. Namely, first one :
static int[] sieveWithLogs(long n, long start, int intervalSize, List<int> factorBase, List<int> factorBaseLogs)
{
int[] result = new int[intervalSize];
Parallel.For(0, factorBase.Count, i =>
{
int p = factorBase[i];
var roots = Tools.ShanksTonelliLONG(n, p);
long r1 = roots.Item2;
long r2 = roots.Item3;
long root1 = r1 + p * ((start - r1) / p + 1);
long root2 = r2 + p * ((start - r2) / p + 1);
long index1 = root1 - start;
long index2 = root2 - start;
while (index1 < intervalSize)
{
result[index1] += factorBaseLogs[i];
index1 += p;
}
while (index1 < intervalSize)
{
result[index2] += factorBaseLogs[i];
index2 += p;
}
});
return result;
}
(DESCRIPTION : factorBase consist of some prime numbers, functions creates int array of given size and adds logarithm of that prime to all instances in array that have form of r + p*k).
Second one is just column addition in matrix:
static void AddColumn(int i, int j, List<bool[]> matrix)
{
Parallel.For(0, matrix.Count, k =>
{
matrix[k][j] = matrix[k][j] ^ matrix[k][i];
});
}
First one is slightly slower than sequential loops version, while second one twice slower. Am I using parallel loop wrong or this operations cannot be performed effectively ?
This is incorrect. You are reading and writing to a shared data structure from multiple threads without any synchronization, and this is a big no-no. And adding a lock to make it safe would likely add even more overhead, making the parallel version even slower.
This should be safe, since the modified data is not shared between threads.
However, each iteration will do very little work, so the overhead of distributing work to different threads is higher than the actual work itself. This is known as "granularity", if the granularity is to fine the overhead becomes significant, it it is to coarse it becomes impossible to load balance tasks effectively. My guess is that both your examples is much to fine grained.
My recommendation is to first check if you can make a single threaded variant faster. Do some profiling and/or benchmarks. See if there are any algorithmic improvements possible. Consider other sources of overheads, like non inlineable method calls, cache misses etc.
Once you have done that, it is fairly common to have an outer parallel loop, and an inner regular loop so that the granularity can be controlled. You can also use the Parallel.For overload with thread local storage, so that each thread can increment its own "result" list without any locking, and in the "localFinally" you sum all the different local results into a final list.