I am using OpenCV DFT in mobiles and tablets, let's say ARM devices. The codes are in C++. I was expecting to be able to optimize FFT performance by using ARM registers and fixed point arithmetics, but I only manage to get double time than OpenCV, not even the same time.
I use RADIX-4 256-point FFT.
Does anybody know what OpenCV does and why is it so difficult to optimize? Which is the fastest FFT algorithm for ARM devices? radix-4, radix-8, 256 points, 1024...
The implementation of OpenCV uses device-specific optimizations on Tegra, Tegra 2, and Tegra 3 devices. On Tegra and Tegra 2 the implementation is parallelized and some operations use GLSL shaders to accelerate on the GPU; on Tegra 3 it also uses NEON SIMD instructions for vectorizing some operations on CPU, and CUDA for even better GPU performance. Given that NVidia leant manpower to the optimization effort, using their in-depth knowledge of the platform, outperforming it for more than the odd uncommon operation would probably be a big task.
This article is mostly Tegra 3 specific, but talks a lot about the kind of techniques they used and the performance speedup they got over optimized but device-independent code.