In my previous question, we discussed the use of parpool('Processes'). In such a parpool setting, if we perform a cheap operation, parfor can become badly slow due to the costs of transferring data between processes. This can be demonstrated using the ticBytes(gcp) function.
Based on the suggestions of @Cris Luengo and @Edric, I switched to using parpool('Threads'). It is expected that for large-scale cases, parfor should be faster than for. I have tested this using various scales ranging from 1e3 to 1e9, and the code is shown below.
len = 8e8;
A = rand(len, 1);
sum1 = 0;
sum2 = 0;
fprintf("Using for loop: ");
tic
for i = 1:len
sum1 = A(i);
end
toc
fprintf("Using parfor loop: ");
tic
parfor i = 1:len
sum2 = A(i);
end
toc
I have made some interesting observations and formulated the following statements:
When the scale is small, for example 1e3,
parforis slower thanfor, because some inner parallel scheduling is executed. This cost is larger than the cost of the "plus" operation.When the scale is very large, for example 1e9,
parforis still much slower thanfor, because the RAM of my computer (16GB) is not sufficient for storing such a huge matrix and reading it from RAM. This can be demostrated by the Windows task manager.When the scale is appropriate, for example 6e8 or 7e8,
parforis faster thanfor. I hope this is not a statistical error.
The results are out, I have average my computations under every scale for 10 times. It seems that, parfor is always slower than for. The figure below is the times costs comparison. The x-label is scale of matrix, and the y-label is time costs.

