The title says it already. I am currently parallelizing my code and a major bottleneck is posed by element-wise multiplication of two three-dimensional ndarrays
. My system monitor reveals that only one of the 40
available cores is used for that operation.
I know parallelization works, because the other scipy.fft
and BLAS operations run in parallel.
So far, I have not really found any meaningful questions/issues on SO or GitHub. It is a bit bewildering that no one else has had this issue. Am I missing something?
I tried playing with BLAS environment variables and using dgbmv
with flattened arrays to achieve the desired behaviour but I have not been successful, yet. A minimal code example would be (with much larger k
, 3d arrays, and broadcasting involved in my case):
import numpy as np
k = 1e6
x = np.random.rand(k)
y = np.random.rand(k)
z = np.multiply(x, y)
You can try to have a look at numexpr : https://pypi.org/project/numexpr/2.6.1/
This lib is supposed to use all your cores.
You can use it like this :