There are lots of NOR QSPI FLASH chips that support XIP (eXecute In Place). In this mode the embedded cpu (or MCU) can directly execute the codes stored in the flash. But as we know, the qspi flash can only output 4-bit data per cycle, while many MCUs, such as ARM Cortex-M series, need a 32-bit instruction per cycle. So the MCU have to wait at least 8 cycles to get a valid instruction, which seems very slow. Besides, the max frequency of a nor qspi flash chip is often below 150MHz and the frequency of STM32F407 is 168MHz, which means longer delay for cpu to receive a valid instruction.
I don't know if my understanding is wrong, but I really couldn't find much details about XIP. The Techinal Reference Manuals of STM32Fxxx only say that they have embedded flash and support XIP, but they don't show any details. Besides, I guess we also need to implement a very complicated QSPI controller in the MCU to support XIP.
Can anyone give me some guidelins to this question?
As far as I know the MCU uses a buffer in RAM to read instruction from external flash there and then executes them. It reads them in chunks. Now the size of one chunk very much depends on each vendor implementation (i.e. how much RAM is availiable, how the flash is connected: SPI, Dual SPI, Quad SPI, Octal SPI, is Direct Memory Access (DMA) possible, does flash support Continuous Read Mode). So if the chunk is small then the core would stall waiting for instructions. If the chunk is large then that uses up RAM and also when branching the chunks that were already loaded into RAM would be reloaded for new code.
So lets say the flash is connected with Dual SPI and DMA is possible. Then for XiP the controller would start by executing some bootloader code (normally from some internal ROM memory. The bootloader sets up the QSPI flash controller and the core's DMA to copy instructions from external Flash to RAM buffer. Then it would start executing the code in that buffer. The DMA would now asynchronously copy instructions to RAM. This means the actual MCU core wastes almost no time in copying code.
You said that you could not find much details about XiP. Best source of info for me were the Application Notes of various manufacturers. The implementations are different but have a lot in common.
Here are 3 example documents:
Microchip AN44065 gives an overview of XiP: http://ww1.microchip.com/downloads/en/AppNotes/Atmel-44065-Execute-in-Place-XIP-with-Quad-SPI-Interface-SAM-V7-SAM-E7-SAM-S7_Application-Note.pdf
ST.com AN5188 page 15 has a performance comparison of instructions in RAM vs external Flash which might be of special interest: https://www.st.com/content/ccc/resource/technical/document/application_note/group0/d8/39/10/2f/ee/c9/4b/19/DM00514974/files/DM00514974.pdf/jcr:content/translations/en.DM00514974.pdf
ST.com AN4760 page 26 describes how the speed improvements can be achieved and XiP architecture in detail, its got some cool formulas too: https://www.st.com/content/ccc/resource/technical/document/application_note/group0/b0/7e/46/a8/5e/c1/48/01/DM00227538/files/DM00227538.pdf/jcr:content/translations/en.DM00227538.pdf