MTLBuffers returning "shifted" values in kernel shader using structs, works on simulator, not on device

39 views Asked by At

I'm writing an iOS app (currently running on iPad) and I'm using some metal compute shaders for some heavy math processing. Things have been working great but I have run into some issues lately in testing on the live device vs the simulator where the live device shows unexpected behavior.

In order to debug the problem I have added various print debug statements and I found that many of the values in the returned arrays of structs are offset. A typical struct in the output buffer would generally have 20-25 float values, they are simply in a struct to simplify moving the data around and not having to keep track of the indices.

The structs are all created identically in Swift and Metal with the same order of floats inside of them. What I'm finding is that values are coming out as if they are shifted (assigned to a different variable within the struct than I expect). Inline with the example code below I would receive values for valueFive for valueSeven, etc. This also happens when I use more than one return buffer, I may return three buffers of custom objects and then I see the issue.

I thought I may be messing up by not assigning every value inside of the kernel shader initially so I corrected that and made sure every value within each struct is set for index.

I also checked using stride vs size when computing the length for the makeBuffer call. When I debug all of these the values are identical so I don't think this is the problem.

I'm trying to figure out what I could be doing wrong or if someone else has run into a similar issue. One thing I did that fixed some of this was avoiding using nested structs. I had a struct containing 4 structs, and each of those contained various number (4-7) float values. I had only combined them for simplicity of moving data around. I found the numbers were coming out jumbled on the iPad and the "fix" for this particular problem was putting each struct in it's own buffer to be returned rather than using a buffer with the combined object.

This is also notably happening when I'm using a single-threaded function. There are many steps to process in the kernel shader, but I only use a thread-width of 1 since there's only one index to calculate.

Here's a sample of an object and the associated swift-side function to run the shader:

(Note: I realize I could clean up the threads definition, I hope this isn't causing the problem)

Swift Code:

struct MyCustomObject: Codable {
    var valueOne:Float = 0
    var valueTwo:Float = 0
    var valueThree:Float = 0
    var valueFour:Float = 0
    var valueFive:Float = 0
    var valueSix:Float = 0
    var valueSeven:Float = 0
    var valueEight:Float = 0
    var valueNine:Float = 0
    var valueTen:Float = 0
    var valueEleven:Float = 0
    var valueTwelve:Float = 0
    var valueThirteen:Float = 0
    var valueFourteen:Float = 0
    var valueFifteen:Float = 0
    var valueSixteen:Float = 0
    var valueSeventeen:Float = 0
    var valueEighteen:Float = 0
    var valueNineteen:Float = 0
    var valueTwenty:Float = 0
}

func processWithMetal(inputArrayOne: [Float], inputArrayTwo: [Float]) -> MyCustomObject {
    
    let commandQueue = device?.makeCommandQueue()
    let gpuFunctionLibrary = device?.makeDefaultLibrary()
    let additionalGPUFunction = gpuFunctionLibrary?.makeFunction(name: "myShaderFunction")
    
    var additionComputePipelineState: MTLComputePipelineState!
    do {
        additionComputePipelineState = try device?.makeComputePipelineState(function: additionalGPUFunction!)
    } catch {
        print(error)
    }
    
    // Input Buffers (RO)
    let inputArrayOneBuff = device?.makeBuffer(bytes: inputArrayOne,
                                      length: MemoryLayout<Float>.size * inputArrayOne.count,
                                      options: .storageModeShared)
    let inputArrayTwoBuff = device?.makeBuffer(bytes: newCandleArray,
                                      length: MemoryLayout<Float>.size * inputArrayTwo.count,
                                      options: .storageModeShared)
    
    // Output Buffers (RW)
    let myObjectOutputBuff = device?.makeBuffer(length: MemoryLayout<MyCustomObject>.size * 1,
                                        options: .storageModeShared)
    
    let commandBuffer = commandQueue?.makeCommandBuffer()
    let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
    commandEncoder?.setComputePipelineState(additionComputePipelineState)
    
    // Input Buffers
    commandEncoder?.setBuffer(inputArrayOneBuff, offset: 0, index: 0)
    commandEncoder?.setBuffer(inputArrayTwoBuff, offset: 0, index: 1)
    
    // Output Buffers
    commandEncoder?.setBuffer(myObjectOutputBuff, offset: 0, index: 3)
    
    let threadsPerGrid = MTLSize(width: 1, height: 1, depth: 1)
    let maxThreadsPerThreadgroup = additionComputePipelineState.maxTotalThreadsPerThreadgroup
    
    // Set width and height
    let threadGroupWidth = additionComputePipelineState.threadExecutionWidth
    let threadGroupHeight = maxThreadsPerThreadgroup / threadGroupWidth
    let threadGroupSize = MTLSize(width: threadGroupWidth,
                                  height: threadGroupHeight,
                                  depth: 1)

    _ = MTLSize(width: (threadsPerGrid.width + threadGroupSize.width - 1) / threadGroupSize.width,
                                      height: (threadsPerGrid.height + threadGroupSize.height - 1) / threadGroupSize.height,
                                      depth: 1)
    
    let threadgroupsPerGrid = MTLSize(width: (1 + threadGroupWidth - 1) / threadGroupWidth, height: 1, depth: 1)
    let threadsPerThreadgroup = MTLSize(width: threadGroupWidth, height: 1, depth: 1)
    
    commandEncoder?.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
    commandEncoder?.endEncoding()
    
    // Submits the command buffer, executes the function
    commandBuffer?.commit()
    commandBuffer?.waitUntilCompleted()
    
    let MyCustomObjectBufferPointer = myObjectOutputBuff?.contents().bindMemory(to: MyCustomObject.self, capacity: MemoryLayout<MyCustomObject>.size * 1)
    let MyCustomObjectResult = UnsafeBufferPointer<MyCustomObject>(start: MyCustomObjectBufferPointer, count: MemoryLayout<MyCustomObject>.size * 1)
    
    return MyCustomObjectResult[0]
}

Metal:

struct MyCustomObject {
    float valueOne;
    float valueTwo;
    float valueThree;
    float valueFour;
    float valueFive;
    float valueSix;
    float valueSeven;
    float valueEight;
    float valueNine;
    float valueTen;
    float valueEleven;
    float valueTwelve;
    float valueThirteen;
    float valueFourteen;
    float valueFifteen;
    float valueSixteen;
    float valueSeventeen;
    float valueEighteen;
    float valueNineteen;
    float valueTwenty;
};

kernel void myShaderFunction(
      // Input Arrays
      constant float *inputArrayOne                  [[ buffer(0) ]],
      constant float *inputArrayTwo                  [[ buffer(1) ]],
      
      // Output Object
      device MyCustomObject *myObjectOut             [[ buffer(2) ]],
      uint   index [[ thread_position_in_grid ]]) {
          
          // Various processing..
          
          // Set output values:
          myObjectOut[index].valueOne = 1;
          myObjectOut[index].valueTwo = 2;
          // ...
          myObjectOut[index].valueTwenty = 20;
}
0

There are 0 answers