I try to implement a propagation scheme for a 32x32x32 3D texture with a glsl compute shader, it would be very nice if I could do x iterations with just one execution of the shader.
I have 3 textures, one is the source one the target and the third accumulates everything. The source and target have to be swapped for each iteration. PseudoCode would look like OpenGL:
glUseProgram(computeShaderId);
glBindImageTexture(0, srcTexId, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA32F);
glBindImageTexture(1, targetTexId, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA32F);
glBindImageTexture(2, accumulateTexId, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA32F);
glDispatchCompute(32,32,32);
GLSL:
#version 430
layout (local_size_x = 1, local_size_y = 1, local_size_z =1) in;
layout(rgba32f) uniform image3D srcTex;
layout(rgba32f) uniform image3D targetTex;
layout(rgba32f) uniform image3D accumulateTex;
void main() {
ivec3 currentPos = ivec3(gl_GlobalInvocationID.xyz);
for (int i=0;i<8;i++){
//accumulate the values of the 6 neighbours (top,bottom,left,right,front,back)
//by usind the current sourceTexture
//this involes loadImage
vec4 neighbourValues=getValuesFrom6Neighbours(currentPos, currentSource);
storeImage(currentTarget,currentPos,neighbourValues);
vec4 value=loadImage(accumTex,currentPos);
storeImage(accumTex,currentPos,neighbourValues+value);
//the texture are swapped, which I have a solution for so no problem here
swapSrcAndTarget();
//here is the Problem how to synchronize all different shader invocations?
someKindOfBarrier();
}
The thing is that I can not do all this in one workgroup beacuse of the size of the texture. Would it be in one workgroup I just could use barrier() and it would be fine. Due to the swapping of the textures I need that all values are updated before there are read again from the next Iteration. Has someone an idea if this is somehow possible?
Thank you Marc
Exactly as you say everything can't fit in the active threads, so I don't believe this is directly possible unless you accept there will be error (when half the values you read may be from either before or after updating). In other words, all threads must finish the first ping before moving on to pong. As only a portion of threads are physically executed at once, putting the passes in a loop won't work.
I can think of two things.