Hello fellow programmers. I have already asked one question, but despite the really good answers I've got I couldn't fix my problem. Then, I took the time to refactor my code in such a way that would improve its parallelization potential (by having less calculation batches with more calculation duty each). But still I can't have a better performance than serial processing.
I suspect this slow parallel processing is due to the context switching. Or maybe it's due to "automatic" synchronization of common objects. I think you can help me understand what's going on.
Let me state my case: I'm making a program for scientific calculations. It does not depends on external things, just on the input values I give to it at its start.
The size of this problem can be measured by Ns
(which is the name I use). It can be seen as the "resolution" of the solution, it is one of the user inputs, and usually is of the order of 100.
In such way, I have several double arrays in my main class such as double ys[Ns][N]
or phiS[Ns][Nord][N]
, where N and Nord are other fixed magnitudes of the program.
In my program, I have to calculate several things for each one of the Ns
points and here comes the parallelization. Each point calculation is independent, so I can divide them to different threads and hope it gets faster.
So, instead of having a loop for (int i=0; i<Ns; <i++)
I divided this calculation duty into Runnable batches, each one ranging inside a smaller interval: for (int i=start; i<end; i++)
, where start and end are allways between 0 and Ns. For example, if I'm on a dual core pc, I make two batches, one with start = 0
and end = Ns/2
, the other with start = Ns/2
and end = Ns
. If I'm on a quad core, the second batch will have start = Ns/4
to end = Ns/2
and so on (assuming the division is exact at every case).
Each Batch, as a class that implements Runnable, is stored in a ArrayList<Batch>
and is given to a FixedThreadPool
with size equal to the number of cores. It execute the batches and waits for them to finish using a simple CountDown
scheme.
Each of this batches needs to access the data on those arrays from the main class of the program, but their access is such that each batch only reads from yS[start][]
to yS[end][]
and therefore two batches will never try to read the same array element. I wonder if Java still locks up yS, even that each batch isn't trying to access the same elements as others.
I wonder also if my problem is related to the overhead due to context switching, as each batch needs to deal with thousands of doubles, and if the way that the program is built can affect it.
Maybe I should find a way to pass to each batch just the elements of the arrays that are relevant to it, but I wouldn't know how to approach this. If there were pointers, I could have new arrays of just the desired elements with simple pointer operations and without reallocating anything. Is there a way to do such a thing in Java?
Well, finally, just to mention: There is one part of the code that needs to be synchronized (it deals with other arrays) and it is already working fine. This calculation duties I've described above aren't the only thing my program does. They are inside a loop, alternating with sequential processing parts, but are really significant as the total execution time.
So, to summarize, the question is: why I'm not gaining with multithreading, when I was expecting to?
I've just run here a couple of times the plain serial and the multithread program and got 14500 ms for the serial and 15651 ms for the multithread. Both on the same Dual Core. Other point to notice: In serial run, each calculation duty (from 0 to Ns) takes around 1.1 to 4.5 ms. From the dual threading, each batch (Ns/2 points) takes around 0.5 to 3 ms; (measured from the the top to bottom of the run() method. Each time of calculation duty differs by it's own numerical convergence)
Thanks very much for the attention.
One possible you may be running in to is threads thrashing over cache lines. If different threads rapidly write to locations in the same cache line (for instance, close in the same array), then the hardware has a high communication overhead ensuring that the data remains consistent.