CUFFT is 1000x slower in VS2013/Cuda7.0 compared to VS2010/Cuda4.2

Question

CUFFT is 1000x slower in VS2013/Cuda7.0 compared to VS2010/Cuda4.2

346 views Asked by The Vivandiere At 13 September 2024 at 16:17

This simple CUFFT code was run on two IDEs -

VS 2013 with Cuda 7.0
VS 2010 with Cuda 4.2

I found that VS 2013 with Cuda 7.0 was a 1000 times slower approximately. The code executed in 0.6 ms in VS 2010, and took 520 ms on VS 2013, both on an average.

#include "stdafx.h"
#include "cuda.h"
#include "cuda_runtime_api.h"
#include "cufft.h"
typedef cuComplex Complex;
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    const int SIZE = 10000;
    Complex *h_col = (Complex*)malloc(SIZE*sizeof(Complex));
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    Complex *d_col;
    cudaMalloc((void**)&d_col, SIZE*sizeof(Complex));
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftHandle plan;
    const int BATCH = 1;
    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds;

    return 0;
}

The code was run on the same computer, with the same OS, same Graphics card, and immediately one after another. The configuration in both cases was x64 Release. You get to choose whether to compile the file using C++ compiler or CUDA C/C++. I tried both the options on both the projects and it made no difference.

Any ideas to fix this?

FWIW, I get the same results with Cuda 6.5 on VS 2013 as Cuda 7

Original Q&A

There are 1 answers

**Robert Crovella** · Accepted Answer · 2015-06-23 20:54:56

The cufft library has gotten considerably larger from 4.2 to 7.0 and it results in substantially more initialization time. If you remove this initialization time as a factor, I think you will find there will be far less than 1000x difference in execution time.

Here's a modified code demonstrating this:

$ cat t807.cu
#include <cufft.h>
#include <cuComplex.h>
typedef cuComplex Complex;
#include <iostream>
using namespace std;
int main(int argc, char* argv[])
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    const int SIZE = 10000;
    Complex *h_col = (Complex*)malloc(SIZE*sizeof(Complex));
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    Complex *d_col;
    cudaMalloc((void**)&d_col, SIZE*sizeof(Complex));
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftHandle plan;
    const int BATCH = 1;
    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds << endl;

    cudaEventRecord(start);
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds << endl;

    return 0;
}
$ nvcc -o t807 t807.cu -lcufft
$ ./t807
94.8298
1.44778
$

The second number above represents essentially the same code with the cufft initialization removed (since it was done on the first pass).

TechQA.

CUFFT is 1000x slower in VS2013/Cuda7.0 compared to VS2010/Cuda4.2

There are 1 answers

Related Questions in C++

Related Questions in VISUAL-STUDIO-2010

Related Questions in VISUAL-STUDIO-2013

Related Questions in CUDA

Related Questions in CUFFT

Popular Questions

Popular Tags

Trending Questions