Numba function slower than C++ and loop re-order further slows down x10

Question

Numba function slower than C++ and loop re-order further slows down x10

426 views Asked by Leo At 19 June 2015 at 14:19

The following code simulates extracting binary words from different locations within a set of images.

The Numba wrapped function, wordcalc in the code below, has 2 problems:

It is 3 times slower compared to a similar implementation in C++.
Most strangely, if you switch the order of the "ibase" and "ibit" for-loops, speed drops by a factor of 10 (!). This does not happen in the C++ implementation which remains unaffected.

I'm using Numba 0.18.2 from WinPython 2.7

What could be causing this?

imDim = 80
numInsts = 10**4
numInstsSub = 10**4/4
bitsNum = 13;

Xs = np.random.rand(numInsts, imDim**2)       
iInstInds = np.array(range(numInsts)[::4])
baseInds = np.arange(imDim**2 - imDim*20 + 1)
ofst1 = np.random.randint(0, imDim*20, bitsNum)
ofst2 = np.random.randint(0, imDim*20, bitsNum)

@nb.jit(nopython=True)
def wordcalc(Xs, iInstInds, baseInds, ofst, bitsNum, newXz):
    count = 0
    for i in iInstInds:
        Xi = Xs[i]        
        for ibit in range(bitsNum):
            for ibase in range(baseInds.shape[0]):                    
                u = Xi[baseInds[ibase] + ofst[0, ibit]] > Xi[baseInds[ibase] + ofst[1, ibit]]
                newXz[count, ibase] = newXz[count, ibase] | np.uint16(u * (2**ibit))
        count += 1
    return newXz

ret = wordcalc(Xs, iInstInds, baseInds, np.array([ofst1, ofst2]), bitsNum, np.zeros((iInstInds.size, baseInds.size), dtype=np.uint16))

Original Q&A

There are 1 answers

**DavidW** · Accepted Answer · 2015-06-21T09:00:25+00:00

I get 4x speed-up by changing from np.uint16(u * (2**ibit)) to np.uint16(u << ibit); i.e. replace the power of 2 with a bitshift, which should be equivalent (for integers).

It seems reasonably likely that your C++ compiler might be making this substitution itself.

Swapping the order of the two loops makes a small difference for me for both your original version (5%) and my optimized version (15%), so I can't think I can make a useful comment on that.

If you really wanted to compare the Numba and C++ you can look at the compiled Numba function by doing os.environ['NUMBA_DUMP_ASSEMBLY']='1' before you import Numba. (That's clearly quite involved though).

For reference, I'm using Numba 0.19.1.

TechQA.

Numba function slower than C++ and loop re-order further slows down x10

There are 1 answers

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in NUMBA

Popular Questions

Popular Tags

Trending Questions