ARM: Disabling MMU and updating PC

2.7k views Asked by At

In short, I would like to shut down all MMU (and cache) operations in a Linux context (from inside the Kernel), for debug purposes, just to run some tests. To be perfectly clear, I don't intend that my system still be functional after that.

About my setup: I'm currently fiddling with a Freescale Vybrid (VF610) - which integrates a Cortex A5 - and its low power modes. Since I'm experimenting some suspiciously local memory corruption while the chip is in "Low Power Stop" mode and my DDR3 in self refresh, I'm trying to shift the operations bit by bit, and right now performing all the suspend/resume steps without actually executing the WFI. Since before this instruction I run with address translation, and after that without (it's essentially a reset), I would like to "simulate" that by "manually" shutting down the MMU.

(I currently have no JTAG nor any other debug access to my chip. I load it via MMC/TFTP/NFS, and debug it with LEDs.)

What I've tried so far:

    /* disable the Icache, Dcache and branch prediction */
    mrc     p15, 0, r6, c1, c0, 0
    ldr r7, =0x1804
    bic r6, r6, r7
    mcr     p15, 0, r6, c1, c0, 0
    isb

    /* disable the MMU and TEX */
    bic r7, r6, r7
    isb
    mcr p15, 0, r6, c1, c0, 0   @ turn on MMU, I-cache, etc
    mrc p15, 0, r6, c0, c0, 0   @ read id reg
    isb
    dsb
    dmb

and other variations to the same effect.

What I observe:

Before the MMU block, I can light a LED (3 assembly instructions, no branch, nothing fancy, nor any access to my DDR, which is already in self refresh - the virtual address for the GPIO port is stored in a register before that).

After the MMU block, I can no more, whether I try with physical or virtual addresses.

I think the problem may be related to my PC, which retains an outdated virtual address. Seeing how things are done elsewhere in the kernel, but the other way round (that is, while enabling translation) :

    ldr r3, =cpu_resume_after_mmu

    instr_sync
    mcr p15, 0, r0, c1, c0, 0   @ turn on MMU, I-cache, etc
    mrc p15, 0, r0, c0, c0, 0   @ read id reg
    instr_sync

    mov r0, r0
    mov r0, r0
    ret r3          @ jump to virtual address
ENDPROC(cpu_resume_mmu)
    .popsection
cpu_resume_after_mmu:

(from arch/arm/kernel/sleep.S, cpu_resume_mmu)

I wonder to what this 2 instructions delay is related to, and where it is documented. I've found nothing on the subject. I've tried something equivalent, without success:

    adr lr, BSYM(phys_block)

    /* disable the Icache, Dcache and branch prediction */
    mrc     p15, 0, r6, c1, c0, 0
    ldr r7, =0x1804
    bic r6, r6, r7
    mcr     p15, 0, r6, c1, c0, 0
    isb

    /* disable the MMU and TEX */
    bic r7, r6, r7
    isb
    mcr p15, 0, r6, c1, c0, 0   @ turn on MMU, I-cache, etc
    mrc p15, 0, r6, c0, c0, 0   @ read id reg
    isb
    dsb
    msb

    mov r0, r0
    mov r0, r0
    ret lr

phys_block:
    blue_light
    loop

Thanks to anyone who has a clue or some pointers!

2

There are 2 answers

2
Aurélien Martin On BEST ANSWER

Since both Jacen and dwelch kindly brought the answer I needed through a comment (each), I will answer my own question here for the sake of clarity:

The trick was simply to add an identity mapping from/to the page doing the transition, allowing us to jump to it with a "physical" (though actually virtual) PC, then disable MMU.

Here is the final code (a bit specific, but commented):

    /* Duplicate mapping to here */

    mrc p15, 0, r4, c2, c0, 0 // Get TTRB0
    ldr r10, =0x00003fff
    bic r4, r10 // Extract page table physical base address
    orr r4, #0xc0000000 // Nastily "translate" it to the virtual one

    /*
     * Here r8 holds vf_suspend's physical address. I had no way of
     * doing this more "locally", since both physical and virtual
     * space for my code are runtime-allocated.
     */

    add lr, r8, #(phys_block-vf_suspend) // -> phys_block physical address 

    lsr r9, lr, #20 // SECTION_SHIFT     -> Page index
    add r7, r4, r9, lsl #2 // PMD_ORDER  -> Entry address
    ldr r10, =0x00000c0e // Flags
    orr r9, r10, r9, lsl #20 // SECTION_SHIFT   -> Entry value
    str r9, [r7] // Write entry

    ret lr  // Jump / transition to virtual addressing

phys_block:
    /* disable the MMU and TEX */
    isb
    mrc     p15, 0, r6, c1, c0, 0
    ldr r7, =0x10000001
    bic r6, r6, r7
    mcr p15, 0, r6, c1, c0, 0   @ turn on MMU, I-cache, etc
    mrc p15, 0, r6, c0, c0, 0   @ read id reg
    isb
    dsb
    dmb

    /* disable the Icache, Dcache and branch prediction */
    mrc     p15, 0, r6, c1, c0, 0
    ldr r7, =0x1804
    bic r6, r6, r7
    mcr     p15, 0, r6, c1, c0, 0
    isb

    // Done !
1
Notlikethat On

To address the "what this 2-instruction delay is" part of the question, as with much of /arch/arm, it's mostly just leftover legacy guff*.

Back in the days long before any kind of barrier instructions, you had to account for the fact that at the point you switch the MMU, the pipeline contains instructions already fetched and decoded before the switch, so having anything like a branch or memory access in there will go horribly wrong if the address space has changed by the time it executes. The ARMv4 Architecture Reference Manual makes the wonderful statement "The correct code sequence for enabling and disabling the MMU is IMPLEMENTATION DEFINED" - in practice what that mostly meant was that you knew your pipeline was 3 stages long so stuck two NOPs in to fill it safely. Or took full advantage of the fact to do horrible things like arrange a jump straight to a translated VA without going via an identity mapping (yikes!).

From an entertaining trawl of old microarchitecture manuals, 3 NOPs are needed for StrongARM (compared to 2 for the 3-stage ARM7 pipeline), and reading CP15 with a data dependency on the result is the recommended self-synchronising sequence for XScale, which explains the apparently pointless read of the main ID register.

On something modern (ARMv6 or later), however, none of this should be needed as you have architected barriers, so you just flip the switch then issue an isb to flush the pipeline, which is what the instr_sync macro expands to when building for such architectures.

* or a fine example of the Linux "works on everything" approach, depending on your point of view...