I am writing a series of test for a GPU's DRAM (global) memory. Specifically targeting AMD GCN architecture of Tahiti and Hawaii model lines. The archs have a write-back L2 caches.
What I want is to ensure that the stores to global memory are indeed written through to global memory before another thread does a read.
The barrier and mem_fence
documentation in the spec states:
CLK_GLOBAL_MEM_FENCE
- The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data.
However, this only enforces correct ordering. My question is does this trigger a write to global memory of the L2 cache data?
OpenCL 1.2 gives next to no control of this. The fences are very poorly defined and technically if you read carefully only affect the work-group. So most likely nothing will force the cache to flush until the kernel completes.
OpenCL 2.0 gives you full ordering control. Ordering is all you get, not explicit cache operations.
If you do a release write to all_svm_devices scope then by the time you can see that in a work-item on a different device you know that every write before it must be visible too. This may mean the cache has been flushed if the cache was not using a standard ownership-based coherence protocol.
If you release to device scope only and L2 is shared across the whole device there would be no need to flush it to guarantee that ordering.
The memory model is defined entirely in terms of ordering, not in terms of caches, but with scopes it is intended to allow efficient implementation on very relaxed cache hierarchies.