Confused about the XIP (eXecute In Place) function of QSPI FLASH

7.5k views Asked by At

There are lots of NOR QSPI FLASH chips that support XIP (eXecute In Place). In this mode the embedded cpu (or MCU) can directly execute the codes stored in the flash. But as we know, the qspi flash can only output 4-bit data per cycle, while many MCUs, such as ARM Cortex-M series, need a 32-bit instruction per cycle. So the MCU have to wait at least 8 cycles to get a valid instruction, which seems very slow. Besides, the max frequency of a nor qspi flash chip is often below 150MHz and the frequency of STM32F407 is 168MHz, which means longer delay for cpu to receive a valid instruction.

I don't know if my understanding is wrong, but I really couldn't find much details about XIP. The Techinal Reference Manuals of STM32Fxxx only say that they have embedded flash and support XIP, but they don't show any details. Besides, I guess we also need to implement a very complicated QSPI controller in the MCU to support XIP.

Can anyone give me some guidelins to this question?

2

There are 2 answers

3
user10607 On

As far as I know the MCU uses a buffer in RAM to read instruction from external flash there and then executes them. It reads them in chunks. Now the size of one chunk very much depends on each vendor implementation (i.e. how much RAM is availiable, how the flash is connected: SPI, Dual SPI, Quad SPI, Octal SPI, is Direct Memory Access (DMA) possible, does flash support Continuous Read Mode). So if the chunk is small then the core would stall waiting for instructions. If the chunk is large then that uses up RAM and also when branching the chunks that were already loaded into RAM would be reloaded for new code.

So lets say the flash is connected with Dual SPI and DMA is possible. Then for XiP the controller would start by executing some bootloader code (normally from some internal ROM memory. The bootloader sets up the QSPI flash controller and the core's DMA to copy instructions from external Flash to RAM buffer. Then it would start executing the code in that buffer. The DMA would now asynchronously copy instructions to RAM. This means the actual MCU core wastes almost no time in copying code.

You said that you could not find much details about XiP. Best source of info for me were the Application Notes of various manufacturers. The implementations are different but have a lot in common.

Here are 3 example documents:

0
vjalle On

XIP is a feature of the QSPI controller in the MCU, not a feature of the flash device itself. QSPI can be fast enough to be memory-mapped. That is, there is a dedicated memory area, and when that's accessed, the QSPI controller automatically issues the proper commands and fetches the data. The core has to wait for the access that usually takes much longer than accessing parallel memories. Of course that depends on the core clock and the QSPI configuration.

In some devices both the data and instruction buses can be connected to QSPI while in others only the data bus is connected. The latter devices support memory-mapped operation but not XIP. Some devices can only do memory-mapped reads while others can write, too. Some devices feature dedicated cache/buffer memory inside the QSPI controller, and prefetch data for improved performance while others directly translate AHB accesses without "thinking" much. There are many different implementations with various performance.

For the flash device XIP is just a read operation. No special support needed.