I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single pixel, which is a data-parallel task.
To accelerate the computation, I'm using shared memory to store intermediate results:
__shared__ float smem_rads[32];
__shared__ float smem_colors[32 * 3];
__shared__ float smem_ps[32 * 3];
// Load values into the shared memory arrays every 32 indices
for (int index = 0; index < cuConstRendererParams.numCircles; ++index) {
int index3 = 3 * index;
if (index % 32 == 0) {
__syncthreads();
if (threadId < 32) {
smem_ps[threadId] = cuConstRendererParams.position[index3 + threadId];
smem_ps[threadId + 32] = cuConstRendererParams.position[index3 + threadId + 32];
smem_ps[threadId + 64] = cuConstRendererParams.position[index3 + threadId + 64];
smem_rads[threadId] = cuConstRendererParams.radius[index + threadId];
smem_colors[threadId] = cuConstRendererParams.color[index3 + threadId];
smem_colors[threadId + 32] = cuConstRendererParams.color[index3 + threadId + 32];
smem_colors[threadId + 64] = cuConstRendererParams.color[index3 + threadId + 64];
}
__syncthreads();
}
// Access the shared memory values later
float3 p = ((float3*)smem_ps)[index % 32];
float rad = smem_rads[index % 32];
float rgb = ((float3*)smem_colors)[index % 32];
}
However, when analyzing the code using Nsight Compute, I'm encountering bank conflicts.

Could someone please explain the reason for these bank conflicts and suggest ways to resolve them? I'd greatly appreciate any insights or recommendations to optimize my shared memory access patterns.
Other Relevant Information:
- CUDA version: 12.2
- GPU model: NVIDIA RTX 3090
I try to change 32 to other number, but it does not work.
EDIT: Thank you for all your assistance. I have used __ldg instructions for loading const global memory, and the bank conflicts have disappeared. However, it appears that bank conflicts were not the primary factor impacting performance.