numpy ufuncs speed vs for loop speed

9k views Asked by At

I've read a lot "avoid for loops with numpy". So, I tried. I was using this code (simplified version). Some auxiliary data:

 In[1]: import numpy as np
        resolution = 1000                             # this parameter varies
        tim = np.linspace(-np.pi, np.pi, resolution) 
        prec = np.arange(1, resolution + 1)
        prec = 2 * prec - 1
        values = np.zeros_like(tim)

My first implementation was with for loop:

 In[2]: for i, ti in enumerate(tim):
           values[i] = np.sum(np.sin(prec * ti))

Then, I got rid of the explicit for cycle, and achieved this:

 In[3]: values = np.sum(np.sin(tim[:, np.newaxis] * prec), axis=1)

And this solution was faster for small arrays, but when I scaled up I got such time dependence: enter image description here

What I'm missing or is it normal behavior? And if it is not, where to dig?

EDIT: According to the comments, here is some additional information. The times were measured with IPython's %timeit and %%timeit, every run was performed on the fresh kernel. My laptop is acer aspire v7-482pg (i7, 8GB). I'm using:

  • python 3.5.2
  • numpy 1.11.2 + mkl
  • Windows 10
1

There are 1 answers

7
MSeifert On BEST ANSWER

This is normal and expected behaviour. It's just too simplified to apply the "avoid for loops with numpy" statement everywere. If you're dealing with inner loops it's (almost) always true. But in case of outer loops (like in your case) there are far more exceptions. Especially if the alternative is to use broadcasting because this speeds up your operation by using a lot more memory.

Just to add a bit of background to that "avoid for loops with numpy" statement:

NumPy arrays are stored as contiguous arrays with types. The Python int is not the same as a C int! So whenever you iterate over each item in an array you need to plug the item from the array, convert it to a Python int and then do whatever you want to do with it and finally you may need to convert it to a c integer again (called boxing and unboxing the value). For example you want to sum the items in an array using Python:

import numpy as np
arr = np.arange(1000)
%%timeit
acc = 0
for item in arr:
    acc += item
# 1000 loops, best of 3: 478 µs per loop

You better use numpy:

%timeit np.sum(arr)
# 10000 loops, best of 3: 24.2 µs per loop

Even if you push the loop into Python C code you're far away from the numpy performance:

%timeit sum(arr)
# 1000 loops, best of 3: 387 µs per loop

There might be exceptions from this rule but these will be really sparse. At least as long as there is some equivalent numpy functionality. So if you would iterate over single elements then you should use numpy.


Sometimes a plain python loop is enough. It's not widly advertised but numpy functions have a huge overhead compared to Python functions. For example consider a 3-element array:

arr = np.arange(3)
%timeit np.sum(arr)
%timeit sum(arr)

Which one will be faster?

Solution: The Python function performs better than the numpy solution:

# 10000 loops, best of 3: 21.9 µs per loop  <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python

But what does this have to do with your example? Not all that much in fact because you always use numpy-functions on arrays (not single elements and not even few elements) so your inner loop already uses the optimized functions. That's why both perform roughly the same (+/- a factor 10 with very few elements to a factor 2 at roughly 500 elements). But that's not really the loop overhead, it's the function call overhead!

Your loop solution

Using line-profiler and a resolution = 100:

def fun_func(tim, prec, values):
    for i, ti in enumerate(tim):
        values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2       101          752      7.4      5.7      for i, ti in enumerate(tim):
     3       100        12449    124.5     94.3          values[i] = np.sum(np.sin(prec * ti))

95% is spent inside the loop, I've even split the loop body into several parts to verify this:

def fun_func(tim, prec, values):
    for i, ti in enumerate(tim):
        x = prec * ti
        x = np.sin(x)
        x = np.sum(x)
        values[i] = x
%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2       101          609      6.0      3.5      for i, ti in enumerate(tim):
     3       100         4521     45.2     26.3          x = prec * ti
     4       100         4646     46.5     27.0          x = np.sin(x)
     5       100         6731     67.3     39.1          x = np.sum(x)
     6       100          714      7.1      4.1          values[i] = x

The time consumers are np.multiply, np.sin, np.sum here, as you can easily check by comparing their time per call with their overhead:

arr = np.ones(1, float)
%timeit np.sum(arr)
# 10000 loops, best of 3: 22.6 µs per loop

So as soon as the comulative function call overhead is small compared to the calculation runtime you'll have similar runtimes. Even with 100 items you're quite close to the overhead time. The trick is to know at which point they break-even. With 1000 items the call-overhead is still significant:

%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2      1001         5864      5.9      2.4      for i, ti in enumerate(tim):
     3      1000        42817     42.8     17.2          x = prec * ti
     4      1000       119327    119.3     48.0          x = np.sin(x)
     5      1000        73313     73.3     29.5          x = np.sum(x)
     6      1000         7287      7.3      2.9          values[i] = x

But with resolution = 5000 the overhead is quite low compared to the runtime:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2      5001        29412      5.9      0.9      for i, ti in enumerate(tim):
     3      5000       388827     77.8     11.6          x = prec * ti
     4      5000      2442460    488.5     73.2          x = np.sin(x)
     5      5000       441337     88.3     13.2          x = np.sum(x)
     6      5000        36187      7.2      1.1          values[i] = x

When you spent 500us in each np.sin call you don't care about the 20us overhead anymore.

A word of caution is probably in order: line_profiler probably includes some additional overhead per line, maybe also per function call, so the point at which the function call overhead becomes negligable might be lower!!!

Your broadcast solution

I started by profiling the first solution, let's do the same with the second solution:

def fun_func(tim, prec, values):
    x = tim[:, np.newaxis]
    x = x * prec
    x = np.sin(x)
    x = np.sum(x, axis=1)
    return x

Again using the line_profiler with resolution=100:

%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           27     27.0      0.5      x = tim[:, np.newaxis]
     3         1          638    638.0     12.9      x = x * prec
     4         1         3963   3963.0     79.9      x = np.sin(x)
     5         1          326    326.0      6.6      x = np.sum(x, axis=1)
     6         1            4      4.0      0.1      return x

This already exceeds the overhead time significantly and thus we end up a factor 10 faster compared to the loop.

I also did the profiling for resolution=1000:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           28     28.0      0.0      x = tim[:, np.newaxis]
     3         1        17716  17716.0     14.6      x = x * prec
     4         1        91174  91174.0     75.3      x = np.sin(x)
     5         1        12140  12140.0     10.0      x = np.sum(x, axis=1)
     6         1           10     10.0      0.0      return x

and with precision=5000:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           34     34.0      0.0      x = tim[:, np.newaxis]
     3         1       333685 333685.0     11.1      x = x * prec
     4         1      2391812 2391812.0    79.6      x = np.sin(x)
     5         1       280832 280832.0      9.3      x = np.sum(x, axis=1)
     6         1           14     14.0      0.0      return x

The 1000 size is still faster, but as we've seen there the call overhead was still not-negligable in the loop-solution. But for resolution = 5000 the time spent in each step is almost identical (some are a bit slower, others faster but overall quite similar)

Another effect is that the actual broadcasting when you do the multiplication becomes significant. Even with the very smart numpy solutions this still includes some additional calculations. For resolution=10000 you see that the broadcasting multiplication starts to take up more "% time" in relation to the loop solution:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def broadcast_solution(tim, prec, values):
     2         1           37     37.0      0.0      x = tim[:, np.newaxis]
     3         1      1783345 1783345.0    13.9      x = x * prec
     4         1      9879333 9879333.0    77.1      x = np.sin(x)
     5         1      1153789 1153789.0     9.0      x = np.sum(x, axis=1)
     6         1           11     11.0      0.0      return x


Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           def loop_solution(tim, prec, values):
     9     10001        62502      6.2      0.5      for i, ti in enumerate(tim):
    10     10000      1287698    128.8     10.5          x = prec * ti
    11     10000      9758633    975.9     79.7          x = np.sin(x)
    12     10000      1058995    105.9      8.6          x = np.sum(x)
    13     10000        75760      7.6      0.6          values[i] = x

But there is another thing besides actual time spent: the memory consumption. Your loop solution requires O(n) memory because you always process n elements. The broadcasting solution however requires O(n*n) memory. You probably have to wait some time if you use resolution=20000 with your loop but it would still only require 8bytes/element * 20000 element ~= 160kB but with the broadcasting you'll need ~3GB. And this neglects the constant factor (like temporary arrays aka intermediate arrays)! Suppose you go even further you'll run out of memory very fast!


Time to summarize the points again:

  • If you do a python loop over single items in a numpy array you're doing it wrong.
  • If you loop over subarrays of a numpy-array make sure the function call overhead in each loop is negligable compared to the time spent in the function.
  • If you broadcast numpy arrays make sure you don't run out of memory.

But the most important point about optimization is still:

  • Only optimize the code if it's too slow! If it's too slow then optimize only after profiling your code.

  • Don't blindly trust simplified statements and again never optimize without profiling.


One final thought:

Such functions that either require a loop or broadcasting can be easily implemented using , or if there is no already existing solution in or .

For example a numba function that combines the memory efficiency from the loop solution with the speed of the broadcast solution at low resolutions would look like this:

from numba import njit

import math

@njit
def numba_solution(tim, prec, values):
    size = tim.size
    for i in range(size):
        ti = tim[i]
        x = 0
        for j in range(size):
            x += math.sin(prec[j] * ti)
        values[i] = x

As pointed out in the comments numexpr can also evaluate the broadcasted calculation very fast and without requiring O(n*n) memory:

>>> import numexpr
>>> tim_2d = tim[:, np.newaxis]
>>> numexpr.evaluate('sum(sin(tim_2d * prec), axis=1)')