Ping pong propagation in glsl compute shader possible in one call?

885 views Asked by At

I try to implement a propagation scheme for a 32x32x32 3D texture with a glsl compute shader, it would be very nice if I could do x iterations with just one execution of the shader.

I have 3 textures, one is the source one the target and the third accumulates everything. The source and target have to be swapped for each iteration. PseudoCode would look like OpenGL:

glUseProgram(computeShaderId);
glBindImageTexture(0, srcTexId, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA32F);
glBindImageTexture(1, targetTexId, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA32F);
glBindImageTexture(2, accumulateTexId, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA32F);
glDispatchCompute(32,32,32);

GLSL:

#version 430
layout (local_size_x = 1, local_size_y = 1, local_size_z =1) in;
layout(rgba32f) uniform image3D srcTex;
layout(rgba32f) uniform image3D targetTex;
layout(rgba32f) uniform image3D accumulateTex;

void main() {
  ivec3 currentPos = ivec3(gl_GlobalInvocationID.xyz);

  for (int i=0;i<8;i++){
    //accumulate the values of the 6 neighbours (top,bottom,left,right,front,back)
    //by usind the current sourceTexture
    //this involes  loadImage 
    vec4 neighbourValues=getValuesFrom6Neighbours(currentPos, currentSource);

    storeImage(currentTarget,currentPos,neighbourValues);

    vec4 value=loadImage(accumTex,currentPos);
    storeImage(accumTex,currentPos,neighbourValues+value);

    //the texture are swapped, which I have a solution for so no problem here
    swapSrcAndTarget();

    //here is the Problem how to synchronize all different shader invocations?
    someKindOfBarrier();
  }

The thing is that I can not do all this in one workgroup beacuse of the size of the texture. Would it be in one workgroup I just could use barrier() and it would be fine. Due to the swapping of the textures I need that all values are updated before there are read again from the next Iteration. Has someone an idea if this is somehow possible?

Thank you Marc

1

There are 1 answers

2
jozxyqk On BEST ANSWER

Exactly as you say everything can't fit in the active threads, so I don't believe this is directly possible unless you accept there will be error (when half the values you read may be from either before or after updating). In other words, all threads must finish the first ping before moving on to pong. As only a portion of threads are physically executed at once, putting the passes in a loop won't work.

I can think of two things.

  1. Breaking up the problem into tiles that can fit, but then there will be no communication between tile edges (neighbours may be stale) until finishing the kernel/dispatch.
  2. Implementing your own scheduling and, using atomic opts, attempting to fetch tasks until a complete ping has been done (implying a manual sync). Only then move on to pong after a memoryBarrier(). From experience this will probably be a lot slower than putting glDispatchCompute in a for loop as you're doing.