Does simultaneous multithreading make use of interleaved / temporal multithreading?

159 views Asked by At

I am trying to understand simultaneous multithreading (SMT) but I have just run into a problem.

Here is what I figured out so far
SMT requires a superscalar processor to work. Both technologies - superscalar and SMT - allow to execute multiple instructions at the same time. Whilst a "simple" superscalar processor requires all instruction executed / issued within one cycle to be part of a single thread, SMT allows to execute instructions of different threads at the same time. SMT is advantageous over superscalar processors because the instructions of a single thread do often have dependency meaning that we cannot execute them all at the same time. Instructions of different threads do not have these dependencies allowing us to execute a larger number of instructions at the same time.

I hope I got that right so far.

Here is the problem I have

SMT is said to be a mix of superscalar processors and interleaved / temporal multithreading. Personally, I cannot see how interleaved multithreading is involved in SMT.

Interleaved multithreading is better than no multithreading. That's because interleaved multithreading allows a context switch when high latency events (e.g. cache misses) occurs. Whilst the information is loaded into the cache, the processor can carry on with a different thread which increases the performance.

I wonder if SMT also makes use of interleaved multithreading. Or, to put it into a question, what happens when high latency events occur in a SMT architecture?

Example of what I was thinking of

Let's assume we have got a 4-way-superscalar SMT processor and there are 5 different threads waiting to be executed. Let's also assume that the instructions of each thread are dependent of the previous instruction in a way that requires to only execute one instruction of each thread at a time.

If there aren't any high latency events, i figure the execution of the instruction could look somehow like that (each number and color correspond to a thread.):

Without cache miss

We would just be sequentially executing the first 4 threads using the processor ideally. Thread 5 just needs to wait until another thread is finished.

What I really want to figure out is, what happens if a high latency event occurs. Let's assume that situation to be the same but this time thread one runs into a cache miss at the first instruction. What happens could look somehow like this:

Cache miss without multithreading

We would have to wait until the information from the memory is loaded. Unless we are additionally using interleaving multithreading such as block interleaving with switch-on-cache-miss. Then it could look like one of these:

Cache miss with multithreading

I found pictures only which might suggest that SMT uses some kind of fine grain multithreading but I couldn't find any information to really confirm this.

I would be really thankful if someone could help me to understand how this part of SMT works. This detail is driving me crazy!

1

There are 1 answers

7
janneb On

The term SMT usually refers to Out-of-Order (OoO) processors. OoO processors already have all this machinery for handling dependencies between instructions, a physical register file that is a lot larger than the architectural register file, and so forth. In such a processor adding SMT is relatively simple, essentially the processor just needs support for the extra per-thread architectural state, and then to tag each instruction specifying which HW thread it belongs to, after that the OoO execution machinery handles all the queued instructions just like before. So the OoO machinery handles dependencies between instructions just as usual, handles instructions which are waiting for e.g. a cache miss etc etc. And instructions from different, or the same, threads are free to executive on whichever execution pipelines are free and able to execute the instruction, regardless of which thread it belongs to, subject to all the dependencies having been satisfied. And yes, instructions from multiple threads can execute concurrently on a superscalar core.

Interleaved multithreading OTOH, are what you'll find in in-order processors. These processors lack all the sophisticated (and power hungry!) OoO logic, so they must do something simpler, unless they want to morph into OoO processors with all the costs that entail. Thus they choose a simpler form of multithreading, where at each point in time the processor only executes instructions from a single thread. So even if the processor is a superscalar one, at each point in time it can only execute instructions belonging to a single thread. After some time (in some processors as short as every cycle) the processor switches to another thread. And if a thread is blocked, e.g. waiting for a cache miss to be resolved, then the processor bypasses that thread and runs the other threads. From the OS perspective all the threads are simultaneously running though, this happens at a much finer granularity than OS threads.