I wrote a very simple memset in c that works fine up to -O2 but not with -O3...
memset:
void * memset(void * blk, int c, size_t n)
{
unsigned char * dst = blk;
while (n-- > 0)
*dst++ = (unsigned char)c;
return blk;
}
...which compiles to this assembly when using -O2:
20000430 <memset>:
20000430: e3520000 cmp r2, #0 @ compare param 'n' with zero
20000434: 012fff1e bxeq lr @ if equal return to caller
20000438: e6ef1071 uxtb r1, r1 @ else zero extend (extract byte from) param 'c'
2000043c: e0802002 add r2, r0, r2 @ add pointer 'blk' to 'n'
20000440: e1a03000 mov r3, r0 @ move pointer 'blk' to r3
20000444: e4c31001 strb r1, [r3], #1 @ store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 @ compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> @ if not equal store next byte
20000450: e12fff1e bx lr @ else back to caller
This makes sense to me. I annotated what happens here.
When I compile it with -O3 the program crashes. My memset calls itself repeatedly until it ate the whole stack:
200005e4 <memset>:
200005e4: e3520000 cmp r2, #0 @ compare param 'n' with zero
200005e8: e92d4010 push {r4, lr} @ ? (1)
200005ec: e1a04000 mov r4, r0 @ move pointer 'blk' to r4 (temp to hold return value)
200005f0: 0a000001 beq 200005fc <memset+0x18> @ if equal (first line compare) jump to epilogue
200005f4: e6ef1071 uxtb r1, r1 @ zero extend (extract byte from) param 'c'
200005f8: ebfffff9 bl 200005e4 <memset> @ call myself ? (2)
200005fc: e1a00004 mov r0, r4 @ epilogue start. move return value to r0
20000600: e8bd8010 pop {r4, pc} @ restore r4 and back to caller
I can't figure out how this optimised version is supposed to work without any strb or similar. It doesn't matter if I try to set the memory to '0' or something else so the function is not only called on .bss (zero initialised) variables.
(1) This is a problem. This push gets endlessly repeated without a matching pop as it's called by (2) when the function doesn't early-exit because of 'n' being zero. I verified this with uart prints. Also r2 is never touched so why should the compare to zero ever become true?
Please help me understand what's happening here. Is the compiler assuming prerequisites that I may not fulfill?
Background: I'm using external code that requires memset in my baremetal project so I rolled my own. It's only used once on startup and not performance critical.
/edit: The compiler is called with these options:
arm-none-eabi-gcc -O3 -Wall -Wextra -fPIC -nostdlib -nostartfiles -marm -fstrict-volatile-bitfields -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv3
Your first question (1). That is per the calling convention if you are going to make a nested function call you need to preserve the link register, and you need to be 64 bit aligned. The code uses r4 so that is the extra register saved. No magic there.
Your second question (2) it is not calling your memset it is optimizing your code because it sees it as an inefficient memset. Fuz has provided the answers to your question.
Rename the function
and you can see this.
If you were to use -ffreestanding as Fuz recommended then you see this or something like it
which appears like it simply inlined memset, the one it knows not your code (the faster one).
So if you want it to use your code then stick with -O2. Yours is pretty inefficient so not sure why you need to push it any further than it was.
It isn't going to get any better than that without replacing your code with something else.
Fuz already answered the question:
It is replacing your code with memset, if you want it not to do that use -ffreestanding.
If you wish to go beyond that and wonder why -fno-builtin-memset didn't work that is a question for the gcc folks, file a ticket, let us know what they say (or just look at the compiler source code).