This Julia function seems to be quite inefficient (an order of magnitude slower than the equivalent Pythran / C++ code, even after the Julia warmup)...
function my_multi_broadcast(a)
10 * (2*a.^2 + 4*a.^3) + 2 ./ a
end
arr = ones(1000, 1000)
my_multi_broadcast(arr)
I guess it is only that I don't write it correctly... How can one speedup such "multi broadcasts" in Julia? I guess/hope I don't need to expend the loops...
Edit after the first answer
Thank you! With my setup, the Pythran solutions (in place and out of place) are still 1.5 to 2 times faster (without OpenMP). Is there a way to activate SIMD instructions in Julia? Or another way to speed up such CPU computations?
The Python code:
from transonic import jit
@jit
def broadcast(a):
return 10 * (2*a**2 + 4*a**3) + 2 / a
@jit
def broadcast_inplace(a):
a[:] = 10 * (2*a**2 + 4*a**3) + 2 / a
Edit after the @simd suggestion
It seems that @simd does not work out of the box, i.e. just by adding it at the beginning of the line.
ERROR: LoadError: LoadError: Base.SimdLoop.SimdError("for loop expected")
Stacktrace:
[1] compile(::Expr, ::Bool) at ./simdloop.jl:54
[2] @simd(::LineNumberNode, ::Module, ::Any) at ./simdloop.jl:126
[3] include at ./boot.jl:317 [inlined]
[4] include_relative(::Module, ::String) at ./loading.jl:1044
[5] include(::Module, ::String) at ./sysimg.jl:29
[6] exec_options(::Base.JLOptions) at ./client.jl:231
[7] _start() at ./client.jl:425
I guess that one would have to expand the for loops, but then the code (i) becomes much less readable and (ii) is no longer independent of the dimension.
It seems that we have a case for which simple Python/Numpy code can get accelerated with Pythran faster than what we get with Julia (except if there is a way to accelerate this in Julia? and a future Julia version may solve this). Interesting...
Broadcast all operations like this:
The difference is that in
10 * (2*a.^2 + 4*a.^3) + 2 ./ ayou actually do not take advantage of broadcast fusion as*and two+are not broadcasted.Writing
@. 10 * (2*a^2 + 4*a^3) + 2 / ais equivalent to10 .* (2 .* a.^2 .+ 4 .* a.^3) .+ 2 ./ a.And here is the comparison of performance
How does it compare to Pythran / C++, as we get roughly 10x speedup?
Finally note that if you could mutate
arrin place by writing:which is yet faster and does zero allocations (I do not know if you want to modify
arrin place or create a new array so I show both approaches).