Cache coherency deals with read/write ordering for a single memory location in the presence of caches, while memory consistency is about ordering accesses across all locations with/without caches.
Normally, processors/compilers guarantee weak memory ordering, requiring the programmer to insert appropriate synchronisation events to ensure memory consistency.
My question is if programmers anyway have to insert these events, is it possible to achieve memory consistency without cache coherency in cache-based processors? What are the trade-offs here? As far as I know, GPUs don't have coherent caches, so, it should be indeed possible.
My intuition is that synchronisation events would become terribly slow without cache coherency, because whole caches might need to be invalidated/flushed at syncs, instead of specific lines getting flushed/invalidated continuously in the background through the coherency machinery. But I could not find any material discussing these trade-offs (and neither did chatGPT help ;)).
There is a section on this in the book, "Parallel Computer Architecture: A Hardware/Software Approach" (perhaps a bit outdated).
The languages based-on C++ mentioned above are:
Parallel Programming in C**: A Large-Grain Data-Parallel Programming Language
Implementing a parallel C++ runtime system for scalable parallel systems
Parallel programming in Split-C
These look like predecessors of CUDA. So, lack of coherency perhaps make sense for massively-parallel workloads, for which the relatively slow synchronizations (due to a lack of coherency) could still account for only a tiny fraction of the overall runtime.
CRAY T3D and T3E indeed had a shared address space without hardware-supported coherency.