According to the official document:
AtomicCopyBufferUINT and AtomicCopyBufferUINT64 enable late-latch to reduce perceived latency.
I'm having trouble understanding how this could be reliably used for late latching. I understand that the UINT that it copies is like an index or an offset value, and the dependent resource will be "completely updated" (which probably means it syncs with CopyResource or something?).
But index or offset into what?
Scenario A: index or offset into a ring buffer, and the UINT always points to the newest updated buffer
In this case, what happens if the CPU updates the buffer really fast? A possible failing scenario would be: ring buffer has a buffer count of 3, and GPU renders at 60 fps, and mouse input updates at 1000Hz, mouse update easily outpaced GPU rendering, and the buffer that the GPU is currently reading from is overwritten by the CPU. So this clearly isn't reliable.
Scenario B: same as A, but pause the update if the ring buffer is full until GPU finishes one frame
In this case, the pausing defeats the purpose of late latching. Taking the same possible scenario from A, for each new frame, the CPU update pauses after 2 more updates, then waits pointlessly for 16ms - 2ms = 14ms, that's 14ms of extra input delay, where it ideally should've only been 1-2ms for late latching.
Scenario C: same as A, but increase the size of the ring buffer, like 100 times
This can lead to enormous VRAM usage, and it would make it impossible to adapt for every hardware. And it still technically doesn't solve the problems in scenarios A and B. Like, what if in the future, some hardware actually got fast enough that 100 buffers got full before the end of a frame?
Scenario D: same as A, but avoid updating the buffer that the GPU is currently using, update another one instead
This is what I had in mind at first, but I quickly realized that there is no guarantee that the CPU or the GPU can avoid using a buffer that the other is using because there is no interlock method between the CPU and the GPU in the D3D12 API. unless you count ID3D12Fence
.
Scenario E: implement the idea of D using ID3D12Fence
before and after ExecuteCommandLists
, the command queue uses Signal
and Wait
functions with an ID3D12Fence
interface to sync with a 3rd CPU thread to achieve interlock. within that thread, a buffer will be "locked" before ExecuteCommandLists
and "unlocked" after. In this way, the CPU can know which buffer is locked (being used) thanks to the 3rd CPU thread. But if we go with this design then what's the point in having AtomicCopyBufferUINT*
at all? CPU can update the index/offset value here just fine, there's no risk of losing atomicity because there's an interlock.
I feel like a lot of people are going to point out deficiencies from my solution E, but I have no clue what...
I've googled everywhere, and no project or code sample has ever used AtomicCopyBufferUINT*
. The projects that use late latching I found are all DirectX 11 projects using third-party late-latching libraries from AMD or Nvidia.
So to summarize my question: is my idea of implementation in scenario E correct? if not, how should I do late latching? Is AtomicCopyBufferUINT*
really useless or do I just don't know how to use it?