I have a situation where some of the address space is sensitive in that you read it you crash as there is nobody there to respond to that address.
pop {r3,pc}
bx r0
0: e8bd8008 pop {r3, pc}
4: e12fff10 bx r0
8: bd08 pop {r3, pc}
a: 4700 bx r0
The bx was not created by the compiler as an instruction, instead it is the result of a 32 bit constant that didnt fit as an immediate in a single instruction so a pc relative load is setup. This is basically the literal pool. And it happens to have bits that resemble a bx.
Can easily write a test program to generate the issue.
unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
return(more_fun(0x12344700)+1);
}
00000000 <fun>:
0: b510 push {r4, lr}
2: 4802 ldr r0, [pc, #8] ; (c <fun+0xc>)
4: f7ff fffe bl 0 <more_fun>
8: 3001 adds r0, #1
a: bd10 pop {r4, pc}
c: 12344700 eorsne r4, r4, #0, 14
What appears to be happening is the processor is waiting on data coming back from the pop (ldm) moves onto the next instruction bx r0 in this case, and starts a prefetch at the address in r0. Which hangs the ARM.
As humans we see the pop as an unconditional branch, but the processor does not it keeps going through the pipe.
Prefetching and branch prediction are nothing new (we have the branch predictor off in this case), decades old, and not limited to ARM, but the number of instruction sets that have the PC as GPR and instructions that to some extent treat it as non-special are few.
I am looking for a gcc command line option to prevent this. I cant imagine we are the first ones to see this.
I can of course do this
-march=armv4t
00000000 <fun>:
0: b510 push {r4, lr}
2: 4803 ldr r0, [pc, #12] ; (10 <fun+0x10>)
4: f7ff fffe bl 0 <more_fun>
8: 3001 adds r0, #1
a: bc10 pop {r4}
c: bc02 pop {r1}
e: 4708 bx r1
10: 12344700 eorsne r4, r4, #0, 14
preventing the problem
Note, not limited to thumb mode, gcc can produce arm code as well for something like this with the literal pool after the pop.
unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
return(more_fun(0xe12fff10)+1);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e59f0008 ldr r0, [pc, #8] ; 14 <fun+0x14>
8: ebfffffe bl 0 <more_fun>
c: e2800001 add r0, r0, #1
10: e8bd8010 pop {r4, pc}
14: e12fff10 bx r0
Hoping someone knows a generic or arm specific option to do an armv4t like return (pop {r4,lr}; bx lr in arm mode for example) without the baggage or puts a branch to self immediately after a pop pc (seems to solve the problem the pipe is not confused about b as an unconditional branch.
EDIT
ldr pc,[something]
bx rn
also causes a prefetch. which is not going to fall under -march=armv4t. gcc intentionally generates ldrls pc,[]; b somewhere for switch statements and that is fine. Didnt inspect the backend to see if there are other ldr pc,[] instructions generated.
EDIT
Looks like ARM did report this as an Errata (erratum 720247, "Speculative Instruction fetches can be made anywhere in the memory map"), wish I had known that before we spent a month on it...
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html has a
-mpure-code
option, which doesn't put constants in code sections. "This option is only available when generating non-pic code for M-profile targets with the MOVT instruction." so it probably loads constants with a pair of mov-immediate instructions instead of from a constant-pool.This doesn't fully solve your problem though, since speculative execution of regular instructions (after a conditional branch inside a function) with bogus register contents could still trigger access to unpredictable addresses. Or just the first instruction of another function might be a load, so falling through into another function isn't always safe either.
I can try to shed some light on why this is obscure enough that compilers don't already avoid it.
Normally, speculative execution of instructions that fault is not a problem. The CPU doesn't actually take the fault until it becomes non-speculative. Incorrect (or non-existent) branch prediction can make the CPU do something slow before figuring out the right path, but there should never be a correctness problem.
Normally, speculative loads from memory are allowed in most CPU designs. But memory regions with MMIO registers obviously have to be protected from this. In x86 for example, memory regions can be WB (normal, write-back cacheable, speculative loads allowed), or UC (Uncacheable, no speculative loads). Not to mention write-combining write-through...
You probably need something similar to solve your correctness problem, to stop speculative execution from doing something that will actually explode. This includes speculative instruction-fetch triggered by a speculative
bx r0
. (Sorry I don't know ARM, so I can't suggest how you'd do that. But this is why it's only a minor performance problem for most systems, even though they have MMIO registers that can't be speculatively read.)I think it's very unusual to have a setup that lets the CPU do speculative loads from addresses that crash the system instead of just raising an exception when / if they become non-speculative.
This may be why you're always seeing speculative execution beyond an unconditional branch (the
pop
), instead of just very rarely.Nice detective work with using a
bx
to return, showing that your CPU detects that kind of unconditional branch at decode, but doesn't check thepc
bit in apop
. :/In general, branch prediction has to happen before decode, to avoid fetch bubbles. Given the address of a fetch block, predict the next block-fetch address. Predictions are also generated at the instruction level instead of fetch-block level, for use by later stages of the core (because there can be multiple branch instructions in a block, and you need to know which one is taken).
That's the generic theory. Branch prediction isn't 100%, so you can't count on it to solve your correctness problem.
x86 CPUs can have performance problems where the default prediction for an indirect
jmp [mem]
orjmp reg
is the next instruction. If speculative execution starts something that's slow to cancel (likediv
on some CPUs) or triggers a slow speculative memory access or TLB miss, it can delay execution of the correct path once it's determined.So it's recommended (by optimization manuals) to put
ud2
(illegal instruction) orint3
(debug trap) or similar after ajmp reg
. Or better, put one of the jump-table destinations there so "fall-through" is a correct prediction some of the time. (If the BTB doesn't have a prediction, next-instruction is about the only sane thing it can do.)x86 doesn't normally mix code with data, though, so this is more likely to be a problem for architectures where literal pools are common. (But loads from bogus addresses can still happen speculatively after indirect branches, or mispredicted normal branches.
e.g.
if(address_good) { call table[address](); }
could easily mispredict and trigger speculative code-fetch from a bad address. But if the eventual physical address range is marked uncacheable, the load request would stop in the memory controller until it was known to be non-speculativeA return instruction is a type of indirect branch, but it's less likely that a next-instruction prediction is useful. So maybe
bx lr
stalls because speculative fall-through is less likely to be useful?pop {pc}
(akaLDMIA
from the stack pointer) is either not detected as a branch in the decode stage (if it doesn't specifically check thepc
bit), or it's treated as generic indirect branch. There are certainly other use-cases forld
intopc
as a non-return branch, so detecting it as a probable return would require checking the source register encoding as well as thepc
bit.Maybe there's a special (internal hidden) return-address predictor stack that helps get
bx lr
predicted correctly every time, when paired withbl
? x86 does this, to predictcall
/ret
instructions.Have you tested if
pop {r4, pc}
is more efficient thanpop {r4, lr}
/bx lr
? Ifbx lr
is handled specially in more than just avoiding speculative execution of garbage, it might be better to get gcc to do that, instead of having it lead its literal pool with ab
instruction or something.