I've been using the ARM GCC release aarch64-none-elf-gcc-11.2.1 in a baremetal project for some time in a large project that has successfully used libc functions (malloc/memcpy) many times without issue using these options:
-L$AARCH64_GCC_PATH/aarch64-none-elf/lib -lc -lnosys -lg
I recently saw an exception due to an unaligned access during memcpy despite compiling with -mstrict-align.
After isolating the issue and creating a unit test I believe I've found a bug, please ignore the addresses from the objdump and memcpy call, just made them up for this test.
//unit test
#include <stdlib.h>
#include <string.h>
volatile int bssTest;
void swap(int a, int b) {
memcpy((void*)0x500,(void*)0x1000,0xc);
}
0000000000060040 <memcpy>:
60040: f9800020 prfm pldl1keep, [x1]
60044: 8b020024 add x4, x1, x2
60048: 8b020005 add x5, x0, x2
6004c: f100405f cmp x2, #0x10
60050: 54000209 b.ls 60090 <memcpy+0x50> // b.plast
60054: f101805f cmp x2, #0x60
60058: 54000648 b.hi 60120 <memcpy+0xe0> // b.pmore
6005c: d1000449 sub x9, x2, #0x1
60060: a9401c26 ldp x6, x7, [x1]
60064: 37300469 tbnz w9, #6, 600f0 <memcpy+0xb0>
60068: a97f348c ldp x12, x13, [x4, #-16]
6006c: 362800a9 tbz w9, #5, 60080 <memcpy+0x40>
60070: a9412428 ldp x8, x9, [x1, #16]
60074: a97e2c8a ldp x10, x11, [x4, #-32]
60078: a9012408 stp x8, x9, [x0, #16]
6007c: a93e2caa stp x10, x11, [x5, #-32]
60080: a9001c06 stp x6, x7, [x0]
60084: a93f34ac stp x12, x13, [x5, #-16]
60088: d65f03c0 ret
6008c: d503201f nop
60090: f100205f cmp x2, #0x8
60094: 540000e3 b.cc 600b0 <memcpy+0x70> // b.lo, b.ul, b.last
60098: f9400026 ldr x6, [x1]
6009c: f85f8087 ldur x7, [x4, #-8]
600a0: f9000006 str x6, [x0]
600a4: f81f80a7 stur x7, [x5, #-8]
600a8: d65f03c0 ret
600ac: d503201f nop
600b0: 361000c2 tbz w2, #2, 600c8 <memcpy+0x88>
600b4: b9400026 ldr w6, [x1]
600b8: b85fc087 ldur w7, [x4, #-4]
600bc: b9000006 str w6, [x0]
600c0: b81fc0a7 stur w7, [x5, #-4]
600c4: d65f03c0 ret
600c8: b4000102 cbz x2, 600e8 <memcpy+0xa8>
600cc: d341fc49 lsr x9, x2, #1
600d0: 39400026 ldrb w6, [x1]
600d4: 385ff087 ldurb w7, [x4, #-1]
600d8: 38696828 ldrb w8, [x1, x9]
600dc: 39000006 strb w6, [x0]
600e0: 38296808 strb w8, [x0, x9]
600e4: 381ff0a7 sturb w7, [x5, #-1]
600e8: d65f03c0 ret
600ec: d503201f nop
600f0: a9412428 ldp x8, x9, [x1, #16]
600f4: a9422c2a ldp x10, x11, [x1, #32]
600f8: a943342c ldp x12, x13, [x1, #48]
600fc: a97e0881 ldp x1, x2, [x4, #-32]
60100: a97f0c84 ldp x4, x3, [x4, #-16]
60104: a9001c06 stp x6, x7, [x0]
60108: a9012408 stp x8, x9, [x0, #16]
6010c: a9022c0a stp x10, x11, [x0, #32]
60110: a903340c stp x12, x13, [x0, #48]
60114: a93e08a1 stp x1, x2, [x5, #-32]
60118: a93f0ca4 stp x4, x3, [x5, #-16]
6011c: d65f03c0 ret
60120: 92400c09 and x9, x0, #0xf
60124: 927cec03 and x3, x0, #0xfffffffffffffff0
60128: a940342c ldp x12, x13, [x1]
6012c: cb090021 sub x1, x1, x9
60130: 8b090042 add x2, x2, x9
60134: a9411c26 ldp x6, x7, [x1, #16]
60138: a900340c stp x12, x13, [x0]
6013c: a9422428 ldp x8, x9, [x1, #32]
60140: a9432c2a ldp x10, x11, [x1, #48]
60144: a9c4342c ldp x12, x13, [x1, #64]!
60148: f1024042 subs x2, x2, #0x90
6014c: 54000169 b.ls 60178 <memcpy+0x138> // b.plast
60150: a9011c66 stp x6, x7, [x3, #16]
60154: a9411c26 ldp x6, x7, [x1, #16]
60158: a9022468 stp x8, x9, [x3, #32]
6015c: a9422428 ldp x8, x9, [x1, #32]
60160: a9032c6a stp x10, x11, [x3, #48]
60164: a9432c2a ldp x10, x11, [x1, #48]
60168: a984346c stp x12, x13, [x3, #64]!
6016c: a9c4342c ldp x12, x13, [x1, #64]!
60170: f1010042 subs x2, x2, #0x40
60174: 54fffee8 b.hi 60150 <memcpy+0x110> // b.pmore
60178: a97c0881 ldp x1, x2, [x4, #-64]
6017c: a9011c66 stp x6, x7, [x3, #16]
60180: a97d1c86 ldp x6, x7, [x4, #-48]
60184: a9022468 stp x8, x9, [x3, #32]
60188: a97e2488 ldp x8, x9, [x4, #-32]
6018c: a9032c6a stp x10, x11, [x3, #48]
60190: a97f2c8a ldp x10, x11, [x4, #-16]
60194: a904346c stp x12, x13, [x3, #64]
60198: a93c08a1 stp x1, x2, [x5, #-64]
6019c: a93d1ca6 stp x6, x7, [x5, #-48]
601a0: a93e24a8 stp x8, x9, [x5, #-32]
601a4: a93f2caa stp x10, x11, [x5, #-16]
601a8: d65f03c0 ret
601ac: 00000000 udf #0
When performing a memcpy on device type memory where size = 0x8 + 0x4n where n is any natural number, an exception will be thrown as even though care may be taken to have src/dst pointers aligned, the instruction seen on 6009c from the below objdump of memcpy on aarch64 leads to ldur x7, [x4, #-8]. Which in the case of a size 0xc copy would do an LDUR of a 32bit aligned address ending in 0x4 to a 64 bit x register, which results in a Data Abort on system type memory.
While I understand that care must be taken when using stdlib functions in a baremetal application, due to the nature of our codebase it would be very difficult to ensure that every call to memcpy has a size that is 64bit aligned. Shouldn't newlib/compiler take care to ensure that memcpy will use 32bit w registers for any 32bit aligned memcpy anyway? Especially with -mstrict-align?
What are my options as far as providing an immediate fix in the meantime, I suppose I could try to override the definition of memcpy but what source should I base the replacement implementation on in that case.
Any help on this is appreciated, thanks.
Actually, I think the larger "bug" is in your expectations. You simply can't use
memcpyor any other library function on device memory.The default assumption of modern optimizing compilers and libraries is that they are operating on normal memory, whose access has no side effects and which is not being concurrently accessed by any other software or hardware (*). So unaligned access (which gcc and newlib assume by default is okay) is the least of your worries. It is totally fair game for
memcpyto do its work with any combination of loads or stores whatsoever. Including:Three 4-byte accesses
An 8-byte and a 4-byte access
Twelve one-byte accesses
Two overlapping eight-byte accesses
A 16-byte load beyond the bounds of the source buffer, if it can prove that it will not cross a page boundary
Multiple loads of the same address
Multiple stores to the same address, of which any but the last could be the wrong values
Using
-mstrict-aligndoesn't really help. First, as you already noticed, it only affects the code which you actually compile with it; it does nothing about library code that was already built. You would have to rebuild all of newlib with this option, and then audit all the assembly code in newlib separately. But it doesn't help with any of the other issues above, all of which are potentially disastrous for device memory. (And as amonakov noted, since-mstrict-alignis rarely used, it can be prone to compiler bugs.)With device memory, you need exact control over how many loads and stores are done, to which addresses, of which sizes, and in which order. There is only one mechanism in C/C++ to get that, namely
volatile. So all accesses to device memory need to be done explicitly throughvolatilepointers, or using assembly.If you need 32-bit accesses done, I think the only safe way to write your example code is:
And if you do this for all device memory, then you can safely use compiled code and library functions on your normal memory, without needing
-mstrict-aligneither. (Provided that you properly marked all normal memory as such in the page tables, and that theSCTLR_ELx.Abit is cleared.)(*) The C/C++ data race rules do allow multiple readers to concurrently access the same memory. So you can assume that memory which you do not explicitly write, will not be written at all. Beyond that, the compiler has nearly complete liberty to invent / discard / combine / reorder loads and stores in any fashion.