Since AFAIK cycle timings are not published, I've decided to try to measure cycle count using DWT counter on STM32H750-DK; as a first example, I'm measuring a simple delay loop.
It seems that two instructions can be executed by the Cortex-M7 in each cycle. I'd understand this if they are translated into 16-bit instructions. But the results show the same also if I use registers R8 and above and the instructions are translated into 32-bits instructions.
Is it really branch prediction the main player here? On the first run, I get more cycles but on the consequent repetitions the addition of 6 cycles is noticed regardless of N.
Is there any more info somewhere about Cortex-M7 pipeline that would help explaining results I got? I'm not even sure if results make sense. Am I interpreting these results correctly?
//-------------- not measured --------------------------
// ldr r5,=N
// ------------- code under cycle measurement ------
// tloop: subs r5,r5,#1
// bne tloop
// ------------- konec kode ------------------------
/*
// Timings - usually in second or more repetitions
// (on first one cycles are higher in brackets)
╔═══════╤════════════════╗
║ N │ DWT_CYCCNT(1st)║
╠═══════╪════════════════╣
║ 50 │ 56 (78) ║
╟───────┼────────────────╢
║ 100 │ 106 (128) ║
╟───────┼────────────────╢
║ 200 │ 206 ║
╟───────┼────────────────╢
║ 500 │ 506 ║
╟───────┼────────────────╢
║ 1000 │ 1006 ║
╟───────┼────────────────╢
║ 64000 │ 64006 (64028) ║
╚═══════╧════════════════╝
Comment: difference: R5 instructions are 16-bit, R8 instructions are 32-bit,
but both with same timing.
If nop is added, for N=64000, results are 96030 (first run) and 96006.
Conclusion: it seems that branch prediction is the main influencer here.
You are on an STM32, so there is a flash cache and prefetcher. If you are running from flash then that will affect your results.
That particular chip also requires flash wait states depending on clock frequency and voltage further affecting your fetch rates.
The Cortex-M7 has a good-sized fetch line and where small loops are aligned, this can/will have dramatic effect (tens of a percent to double the execution time for the same machine code) on the overall performance.
The Cortex-M7 has a branch predictor, not sure they use that term though, but it is there and if I remember right it is enabled by default.
This is not a PIC. We do not look at instructions and count clocks, we write applications and then profile them if needed. Particularly on architectures/cores like these, adding or removing a single line of high-level language code can have as much double digit percent performance changes in either direction. Folks have argued with me that these cores are in fact predictable, and they are in the sense that the same code sequence without other non-deterministic effects, will run the same number of clocks, and it will. I have demonstrated that many times. But add a NOP to change the alignment of that code, and the number of clocks for that code can change, and that can be by a dramatic amount, resulting in a different, consistent, number of clocks. These are pipelined processors, although not very deep (for the Cortex-M0 and such) and that means not predictable (from inspecting instructions and counting cycles like the good old days).
You also have systemic effects. ARM makes processor cores, IP, not chips. The chip vendor plays a huge role in the execution performance (same goes for x86 — we have not been processor bound for a long time), how those buses are handled and the IP for their flash and SRAM that they buy, arbitration, etc. So as stated above ST does things different from TI and NXP with respect to their Cortex-M products, and all of them are going to have flash performance side effects, even with a zero wait state that typically means half the processor clock speed. The same code in flash with side effects disabled (have to use a TI or maybe NXP, cannot do this with ST), zero wait state on the flash, the performance is half that of SRAM for the same machine code, same alignment (at least I have seen that on a number of products, with ST you can play some games to flush the cache and take a single run at the code).
If your goal is to see if the Cortex-M7 is superscaler, fill the SRAM with hundreds of instructions, thousands. then loop that, one big massive loop that is 99.99...% the instruction under test. Turn off branch prediction and any caching (at that point the few clocks of branch prediction should really be in the wash) and see what you see. I read the databook and datasheet for you for this question, but I did not go back and see what the SRAM performance is. High-performance cores like ones from ARM are going to have sensitivities to the system, fetching and loads and stores. MCUs make it worse with clock domains, and peripherals are a whole other deal (sampling a GPIO pin in a loop is not expected to be as fast as most people think).
The compilers do not know the system either. They will do a PC relative load to pull a difficult constant (0x12345678) into a register instead of the Thumb-2 extensions of I can't remember off hand MOVT or something load half then load half, 64 bits of instruction but it is a linear fetch and not a stop and do a single load cycle from slow flash costing more clocks. Programmers do not realize this either if they are trying to count clocks to increase performance. If that is your ultimate goal here.
You are not processor-bound is the bottom line. You cannot think pipeline and the instruction sequence, etc. unless you are running the core in a simulation and you have a perfect simulated memory where the read data bus responds to the read address bus on the first available clock cycle. With this core even in this situation you would still see branch prediction and fetch line alignment effects. Getting into a real MCU you always have flash issues, and sometimes SRAM issues as well as sometimes general chip glue/implementation issues.