i'm trying to port some code from CPU to GPU that includes some FFTs. So, on CPU code some complex array is transformed using fftw_plan_many_r2r
for both real and imag parts of it separately. Function foo represents R2R transform routine and called twice for each part of complex array.
void foo(vector_double &evg) {
int nx = Dims[0], ny = Dims[1], nz = Dims[2];
const int nxny[] = {ny, nx};
const int n = nx*ny*nz;
const fftw_r2r_kind kinds[] = {FFTW_RODFT00, FFTW_RODFT00};
if (evg.size() != n)
throw std::runtime_error ("*** weird size of evg");
fftw_plan p;
p = fftw_plan_many_r2r(2, nxny, nz,
&evg[0], nxny, 1, nx*ny,
&evg[0], nxny, 1, nx*ny,
kinds, FFTW_ESTIMATE);
// actual FFT
fftw_execute(p);
}
void bar(vector_complex &evg) {
vector_double tmp;
tmp = evg.real();
foo(tmp);
evg.real() = tmp;
tmp = evg.imag();
foo(tmp);
evg.imag() = tmp;
}
So, how can i receive the same results on CUDA since there is no straight conversion from FFTW R2R to cuFFT? P.S. vector_double and vector_complex are Eigen vectors if that helps
I cannot provide the solution, but comments have limit on size, so I'm putting this here:
With FFTW you use inplace transform, but you're not using FFTW_IN_PLACE. I don't know if that's correct, never used inplace transform by myself.
Indeed cuFFT doesn't have R2R, so we have to investigate. According to fftw docs, FFTW_RODFT00 means DST-I. According to wikipedia, DST-I is sine transform, which has equvalent fourier transform if you make a vector of size 2*(N+1) and copy values in reverse, as in the picture on the right marked as DST-I: https://en.wikipedia.org/wiki/Discrete_sine_transform . So if you do a r2c (or c2c) transform of that "extended vector" and drop some values from the transform vector, you'll have exactly R2R transform. See https://en.wikipedia.org/wiki/Discrete_sine_transform#DST-I: "A DST-I is exactly equivalent to a DFT of a real sequence that is odd around the zero-th and middle points, scaled by 1/2".
There are two problems though:
You have to deduce which indices have to be dropped (that is, copied to the resulting vector) by yourself, it's kinda complicated to approach it on the spot.
If you employ the c2r case with additional copying, the GPU has to make a lot more computation than fftw does in r2r case (
2(N+1)
-size transform instead of justN
), and more memory allocations must be done, so it won't be as fast as with r2c or c2c cases. But that according to my experience even older mainstream GPUs are a lot faster than CPUs with FFT (say an order of magnitude), so you may get at least some speedup.