I am trying to optimize the QTC video codec to work on the Raspberry Pi with a decent performance. One important bottleneck is a 32 bits integer division done in the range decoder which takes into account for 18% of the decoding time. Since the device's ARM processor apparently lacks an integer division instruction, I think one could easily optimize this. The division has to be accurate.
Both the dividend and the divisor in that particular division are different each call, but it is known that the divisor is always smaller than 65536. I thought about building a lookup table of inverse divisor values. Using that table I could use a multiplication instead of the division. The lookup table is going to have a size of 256 kibibytes.
Questions
- Is it a good idea to perform that optimization?
- Are there better ways to get rid of the software division?
- Is there a different way to implement the algorithm such that there is no division?
- Other ideas?
If you want to use a magic multiplication + LUT, here's some code.
Simple tester testing random divisors i. Did not exhaustively test all i's, but has worked for the short period I ran it. Seems to work on all 32-bit states of the dividend (j=0..2^32-1) for the i's I tested.
In reality, you would precompute a lookup table for i=2..64k-1 or some range like that (i=0 wont work because value/0 is undefined and i=1 wont work because the magic multiplier for that is just outside the range of a 32-bit number). Then use the equation using i as your lookup index to get the magic multiplier 'm'. Change as needed & don't hate on my style. :P