FFT using C++ fixed-point for optimizing performance for ARM devices

1.9k views Asked by At

I am using OpenCV DFT in mobiles and tablets, let's say ARM devices. The codes are in C++. I was expecting to be able to optimize FFT performance by using ARM registers and fixed point arithmetics, but I only manage to get double time than OpenCV, not even the same time.

I use RADIX-4 256-point FFT.

Does anybody know what OpenCV does and why is it so difficult to optimize? Which is the fastest FFT algorithm for ARM devices? radix-4, radix-8, 256 points, 1024...

1

There are 1 answers

0
Dan Hulme On BEST ANSWER

The implementation of OpenCV uses device-specific optimizations on Tegra, Tegra 2, and Tegra 3 devices. On Tegra and Tegra 2 the implementation is parallelized and some operations use GLSL shaders to accelerate on the GPU; on Tegra 3 it also uses NEON SIMD instructions for vectorizing some operations on CPU, and CUDA for even better GPU performance. Given that NVidia leant manpower to the optimization effort, using their in-depth knowledge of the platform, outperforming it for more than the odd uncommon operation would probably be a big task.

This article is mostly Tegra 3 specific, but talks a lot about the kind of techniques they used and the performance speedup they got over optimized but device-independent code.