We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped caches.
I read that the modern processors try to combine the benefit of both by using a technique called way-prediction. Where they predict the line of the given set where the hit is most likely to happen and search only in that line. If the attempt results in a miss, use the normal set-associative search in all the cache lines of the set.
I want to understand how does this way-prediction work. How is the latency of the prediction hardware/logic smaller than the search latency of the complete set?
The way prediction mechanism for AMD's Bulldozer and Ryzen families are µtag-based and documented in "Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors" (Moritz Lipp et al., 2020, PDF).
µtag-based way prediction matches a hash of the virtual address rather than a full virtual address, so not only does it avoid the address translation overhead like a virtually tagged cache but by using less storage the prediction array can be accessed with lower latency and the tag checked with slightly lower latency. "Take A Way" reversed engineered that both AMD's Bulldozer family and Ryzen family use bits 12 through 27 for the hash function and that a single xor (⊕) layer is used, which reduces latency. The Bulldozer family used 12⊕21, 13⊕22:, 14⊕23, 15⊕24, 16⊕25, 17⊕26, 18⊕27; the Ryzen family used 12⊕27, 13⊕26, 14⊕25, 15⊕20, 16⊕21, 17⊕22, 18⊕23, 19⊕24.
Two aspects of these µtag hash functions are worth noting. First, by using less significant bits rather than the full 48 valid virtual address bits, all of the bits used in the hash function are available earlier because of reduced carry propagation delay (address generation involves an addition and although high performance adders have log(n) delay the less significant bits will still be available earlier). (This effect also means that the twelve least significant bits used to determine the cache set are available even earlier, so the predictor table can be indexed before the µtag has been computed.) Second, in the Ryzen family, the typically least variable (most significant) bits are xored with the typically most variable (least significant) bits for three bits of the hash; this should reduce the probability of false matches. False matches are handled by replacing the match rather than using the ordinary (LRU-oriented) replacement policy; this will usually result in a higher miss rate.
(Recent Intel x86 processors are also known to use µtag-based way predction.)
Other Way Prediction Examples
Way prediction is not a new technique. POWER6 used a µtag predictor with the 11-bit tags being [14:17].([16:23]⊕[24:31]) for a 64 KiB 8-way cache with 128 B cache lines. ("IBM POWER6 microarchitecture", H.Q. Le et al., 2007). One valid bit per hardware thread were also included to avoid thrashing for homonyms (effective address matches for different address spaces). As with Ryzen, there is clearly a recognition that the least significant bits vary more frequently so the two least significant bits are xored with any other bits.
The Pentium4 also used a µtag predictor. According to "The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology" (Darrell Boggs et al., 2004), the 90nm implementation "significantly increased the size of the partial address match from previous implementations, thus reducing the number of false aliasing cases". Details appear not to have been published.
The MIPS R10000 used a simple MRU-based way predictor for its off-chip two-way associative L2 cache. 8Ki single bit prediction entries were provided to indicate the most recently used cache block of a set. If more than 8 Ki sets were provided (up to 128 Ki sets were supported for a 16 MiB L2 cache with 64 B blocks), different sets would use the same prediction bit (predictor aliasing). This way prediction was used to reduce pin count; only one tag would be read at a time and part of the data block from only one way. The alternatives would be a direct-mapped cache (HP PA-RISC used large off-chip, direct-mapped L1 caches) or specialized (more expensive) chips for handling tag comparison (MIPS R8000 used special tag SRAMs that included tag comparison logic and used the comparison result to address ordinary SRAMs holding the data).
The Alpha 21264 instruction cache used a set and way predictor, which could be viewed as a variation of a branch target buffer. For each aligned chunk of four 4-byte instructions, a prediction of the next line (index) and way was included. If a chunk of instructions included a branch that was taken last time it was executed, that branch's target line and way would be the prediction for that line. Control flow instructions with variable targets (including call returns) and branches that change whether they are taken or not would result in mispredictions, but this predictor's accuracy was usually high.
Latency and Power Considerations
Modern high performance processors primarily use way prediction to reduce access energy while maintaining fast access. With support for 32-byte cache accesses (e.g., for AVX) and fast unaligned loads (which effectively doubles the access size), the energy difference between reading eight ways of data in parallel and (usually) only reading one way of data is substantial. The saving in tag read and compare energy are somewhat reduced by the need to read and compare µtags. (Note that relaxing the latency constraint on the TLB — hit confirmation using physical tags and permission tags can occur after the predicted way data is already being used by execution units — can also be exploited to reduce access energy or increase TLB capacity.)
Direct-mapped caches gain a latency advantage from not having to select the correct way before forwarding the data to execution units. Selecting the correct way includes tag comparison and the multiplexor selection itself. However, if the way determination (or prediction) latency is less than the data access latency, the only added latency for set associativity is the pass-through latency of the "warmed-up" multiplexors. Since tag arrays are much smaller than data arrays, their access latency is much smaller, so it is easier (especially with virtual address tags) to determine the way a little before the data itself is available. (In earlier processors, smaller cache blocks — tag array size closer to data array size — and relatively lower wire delay compared to logic delay would make completing way determination before data availability more difficult and modestly increase the impact of selection delay.)