Why isn't NVRTC optimizing out my integer division and modulo operations?

Question

Why isn't NVRTC optimizing out my integer division and modulo operations?

528 views Asked by Kh40tiK At 01 June 2017 at 06:44

I compiled a kernel in NVRTC:

__global__ void kernel_A(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx / 32;
    unsigned char lane_id = idx % 32;
    /* ... */
}

I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:

__global__ void kernel_B(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx >> 5;
    unsigned char lane_id = idx & 31;
    /* ... */
}

it seems kernel_B just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof shows kernel_A runs for 15.2us in average, while kernel_B runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.

The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram() is -arch compute_50.

Is this expected?

Original Q&A

There are 1 answers

**Kh40tiK** · Accepted Answer · 2017-06-10T13:37:20+00:00

Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG mode. This causes additional flags -G and -lineinfo passed to nvrtcCompileProgram()

From nvcc man page:

--device-debug (-G)

Generate debug information for device code. Turns off all optimizations. Don't use for profiling; use -lineinfo instead.

TechQA.

Why isn't NVRTC optimizing out my integer division and modulo operations?

There are 1 answers

Related Questions in OPTIMIZATION

Related Questions in CUDA

Related Questions in COMPILER-FLAGS

Related Questions in NVRTC

Popular Questions

Trending Questions