memcpy on AARCH64 yielding unaligned Data Abort Exception, ARM GNU Toolchain or newlibc Bug?

420 views Asked by At

I've been using the ARM GCC release aarch64-none-elf-gcc-11.2.1 in a baremetal project for some time in a large project that has successfully used libc functions (malloc/memcpy) many times without issue using these options:

-L$AARCH64_GCC_PATH/aarch64-none-elf/lib -lc -lnosys -lg

I recently saw an exception due to an unaligned access during memcpy despite compiling with -mstrict-align.

After isolating the issue and creating a unit test I believe I've found a bug, please ignore the addresses from the objdump and memcpy call, just made them up for this test.

//unit test
#include <stdlib.h>
#include <string.h>
volatile int bssTest;

void swap(int a, int b) {
    memcpy((void*)0x500,(void*)0x1000,0xc);
}
0000000000060040 <memcpy>:
   60040:   f9800020    prfm    pldl1keep, [x1]
   60044:   8b020024    add x4, x1, x2
   60048:   8b020005    add x5, x0, x2
   6004c:   f100405f    cmp x2, #0x10
   60050:   54000209    b.ls    60090 <memcpy+0x50>  // b.plast
   60054:   f101805f    cmp x2, #0x60
   60058:   54000648    b.hi    60120 <memcpy+0xe0>  // b.pmore
   6005c:   d1000449    sub x9, x2, #0x1
   60060:   a9401c26    ldp x6, x7, [x1]
   60064:   37300469    tbnz    w9, #6, 600f0 <memcpy+0xb0>
   60068:   a97f348c    ldp x12, x13, [x4, #-16]
   6006c:   362800a9    tbz w9, #5, 60080 <memcpy+0x40>
   60070:   a9412428    ldp x8, x9, [x1, #16]
   60074:   a97e2c8a    ldp x10, x11, [x4, #-32]
   60078:   a9012408    stp x8, x9, [x0, #16]
   6007c:   a93e2caa    stp x10, x11, [x5, #-32]
   60080:   a9001c06    stp x6, x7, [x0]
   60084:   a93f34ac    stp x12, x13, [x5, #-16]
   60088:   d65f03c0    ret
   6008c:   d503201f    nop
   60090:   f100205f    cmp x2, #0x8
   60094:   540000e3    b.cc    600b0 <memcpy+0x70>  // b.lo, b.ul, b.last
   60098:   f9400026    ldr x6, [x1]
   6009c:   f85f8087    ldur    x7, [x4, #-8]
   600a0:   f9000006    str x6, [x0]
   600a4:   f81f80a7    stur    x7, [x5, #-8]
   600a8:   d65f03c0    ret
   600ac:   d503201f    nop
   600b0:   361000c2    tbz w2, #2, 600c8 <memcpy+0x88>
   600b4:   b9400026    ldr w6, [x1]
   600b8:   b85fc087    ldur    w7, [x4, #-4]
   600bc:   b9000006    str w6, [x0]
   600c0:   b81fc0a7    stur    w7, [x5, #-4]
   600c4:   d65f03c0    ret
   600c8:   b4000102    cbz x2, 600e8 <memcpy+0xa8>
   600cc:   d341fc49    lsr x9, x2, #1
   600d0:   39400026    ldrb    w6, [x1]
   600d4:   385ff087    ldurb   w7, [x4, #-1]
   600d8:   38696828    ldrb    w8, [x1, x9]
   600dc:   39000006    strb    w6, [x0]
   600e0:   38296808    strb    w8, [x0, x9]
   600e4:   381ff0a7    sturb   w7, [x5, #-1]
   600e8:   d65f03c0    ret
   600ec:   d503201f    nop
   600f0:   a9412428    ldp x8, x9, [x1, #16]
   600f4:   a9422c2a    ldp x10, x11, [x1, #32]
   600f8:   a943342c    ldp x12, x13, [x1, #48]
   600fc:   a97e0881    ldp x1, x2, [x4, #-32]
   60100:   a97f0c84    ldp x4, x3, [x4, #-16]
   60104:   a9001c06    stp x6, x7, [x0]
   60108:   a9012408    stp x8, x9, [x0, #16]
   6010c:   a9022c0a    stp x10, x11, [x0, #32]
   60110:   a903340c    stp x12, x13, [x0, #48]
   60114:   a93e08a1    stp x1, x2, [x5, #-32]
   60118:   a93f0ca4    stp x4, x3, [x5, #-16]
   6011c:   d65f03c0    ret
   60120:   92400c09    and x9, x0, #0xf
   60124:   927cec03    and x3, x0, #0xfffffffffffffff0
   60128:   a940342c    ldp x12, x13, [x1]
   6012c:   cb090021    sub x1, x1, x9
   60130:   8b090042    add x2, x2, x9
   60134:   a9411c26    ldp x6, x7, [x1, #16]
   60138:   a900340c    stp x12, x13, [x0]
   6013c:   a9422428    ldp x8, x9, [x1, #32]
   60140:   a9432c2a    ldp x10, x11, [x1, #48]
   60144:   a9c4342c    ldp x12, x13, [x1, #64]!
   60148:   f1024042    subs    x2, x2, #0x90
   6014c:   54000169    b.ls    60178 <memcpy+0x138>  // b.plast
   60150:   a9011c66    stp x6, x7, [x3, #16]
   60154:   a9411c26    ldp x6, x7, [x1, #16]
   60158:   a9022468    stp x8, x9, [x3, #32]
   6015c:   a9422428    ldp x8, x9, [x1, #32]
   60160:   a9032c6a    stp x10, x11, [x3, #48]
   60164:   a9432c2a    ldp x10, x11, [x1, #48]
   60168:   a984346c    stp x12, x13, [x3, #64]!
   6016c:   a9c4342c    ldp x12, x13, [x1, #64]!
   60170:   f1010042    subs    x2, x2, #0x40
   60174:   54fffee8    b.hi    60150 <memcpy+0x110>  // b.pmore
   60178:   a97c0881    ldp x1, x2, [x4, #-64]
   6017c:   a9011c66    stp x6, x7, [x3, #16]
   60180:   a97d1c86    ldp x6, x7, [x4, #-48]
   60184:   a9022468    stp x8, x9, [x3, #32]
   60188:   a97e2488    ldp x8, x9, [x4, #-32]
   6018c:   a9032c6a    stp x10, x11, [x3, #48]
   60190:   a97f2c8a    ldp x10, x11, [x4, #-16]
   60194:   a904346c    stp x12, x13, [x3, #64]
   60198:   a93c08a1    stp x1, x2, [x5, #-64]
   6019c:   a93d1ca6    stp x6, x7, [x5, #-48]
   601a0:   a93e24a8    stp x8, x9, [x5, #-32]
   601a4:   a93f2caa    stp x10, x11, [x5, #-16]
   601a8:   d65f03c0    ret
   601ac:   00000000    udf #0

When performing a memcpy on device type memory where size = 0x8 + 0x4n where n is any natural number, an exception will be thrown as even though care may be taken to have src/dst pointers aligned, the instruction seen on 6009c from the below objdump of memcpy on aarch64 leads to ldur x7, [x4, #-8]. Which in the case of a size 0xc copy would do an LDUR of a 32bit aligned address ending in 0x4 to a 64 bit x register, which results in a Data Abort on system type memory.

While I understand that care must be taken when using stdlib functions in a baremetal application, due to the nature of our codebase it would be very difficult to ensure that every call to memcpy has a size that is 64bit aligned. Shouldn't newlib/compiler take care to ensure that memcpy will use 32bit w registers for any 32bit aligned memcpy anyway? Especially with -mstrict-align?

What are my options as far as providing an immediate fix in the meantime, I suppose I could try to override the definition of memcpy but what source should I base the replacement implementation on in that case.

Any help on this is appreciated, thanks.

1

There are 1 answers

0
Nate Eldredge On

Actually, I think the larger "bug" is in your expectations. You simply can't use memcpy or any other library function on device memory.

The default assumption of modern optimizing compilers and libraries is that they are operating on normal memory, whose access has no side effects and which is not being concurrently accessed by any other software or hardware (*). So unaligned access (which gcc and newlib assume by default is okay) is the least of your worries. It is totally fair game for memcpy to do its work with any combination of loads or stores whatsoever. Including:

  • Three 4-byte accesses

  • An 8-byte and a 4-byte access

  • Twelve one-byte accesses

  • Two overlapping eight-byte accesses

  • A 16-byte load beyond the bounds of the source buffer, if it can prove that it will not cross a page boundary

  • Multiple loads of the same address

  • Multiple stores to the same address, of which any but the last could be the wrong values

Using -mstrict-align doesn't really help. First, as you already noticed, it only affects the code which you actually compile with it; it does nothing about library code that was already built. You would have to rebuild all of newlib with this option, and then audit all the assembly code in newlib separately. But it doesn't help with any of the other issues above, all of which are potentially disastrous for device memory. (And as amonakov noted, since -mstrict-align is rarely used, it can be prone to compiler bugs.)

With device memory, you need exact control over how many loads and stores are done, to which addresses, of which sizes, and in which order. There is only one mechanism in C/C++ to get that, namely volatile. So all accesses to device memory need to be done explicitly through volatile pointers, or using assembly.

If you need 32-bit accesses done, I think the only safe way to write your example code is:

volatile uint32_t *dest = (volatile uint32_t *)0x500;
volatile uint32_t *src = (volatile uint32_t *)0x1000;
for (int i = 0; i < 3; i++)
    dest[i] = src[i];

And if you do this for all device memory, then you can safely use compiled code and library functions on your normal memory, without needing -mstrict-align either. (Provided that you properly marked all normal memory as such in the page tables, and that the SCTLR_ELx.A bit is cleared.)


(*) The C/C++ data race rules do allow multiple readers to concurrently access the same memory. So you can assume that memory which you do not explicitly write, will not be written at all. Beyond that, the compiler has nearly complete liberty to invent / discard / combine / reorder loads and stores in any fashion.