The following code simulates extracting binary words from different locations within a set of images.
The Numba wrapped function, wordcalc in the code below, has 2 problems:
- It is 3 times slower compared to a similar implementation in C++.
- Most strangely, if you switch the order of the "ibase" and "ibit" for-loops, speed drops by a factor of 10 (!). This does not happen in the C++ implementation which remains unaffected.
I'm using Numba 0.18.2 from WinPython 2.7
What could be causing this?
imDim = 80
numInsts = 10**4
numInstsSub = 10**4/4
bitsNum = 13;
Xs = np.random.rand(numInsts, imDim**2)
iInstInds = np.array(range(numInsts)[::4])
baseInds = np.arange(imDim**2 - imDim*20 + 1)
ofst1 = np.random.randint(0, imDim*20, bitsNum)
ofst2 = np.random.randint(0, imDim*20, bitsNum)
@nb.jit(nopython=True)
def wordcalc(Xs, iInstInds, baseInds, ofst, bitsNum, newXz):
count = 0
for i in iInstInds:
Xi = Xs[i]
for ibit in range(bitsNum):
for ibase in range(baseInds.shape[0]):
u = Xi[baseInds[ibase] + ofst[0, ibit]] > Xi[baseInds[ibase] + ofst[1, ibit]]
newXz[count, ibase] = newXz[count, ibase] | np.uint16(u * (2**ibit))
count += 1
return newXz
ret = wordcalc(Xs, iInstInds, baseInds, np.array([ofst1, ofst2]), bitsNum, np.zeros((iInstInds.size, baseInds.size), dtype=np.uint16))
I get 4x speed-up by changing from
np.uint16(u * (2**ibit))
tonp.uint16(u << ibit)
; i.e. replace the power of 2 with a bitshift, which should be equivalent (for integers).It seems reasonably likely that your C++ compiler might be making this substitution itself.
Swapping the order of the two loops makes a small difference for me for both your original version (5%) and my optimized version (15%), so I can't think I can make a useful comment on that.
If you really wanted to compare the Numba and C++ you can look at the compiled Numba function by doing
os.environ['NUMBA_DUMP_ASSEMBLY']='1'
before you import Numba. (That's clearly quite involved though).For reference, I'm using Numba 0.19.1.