Call const function address passed to gcc inline assembler (avr-gcc)

1.7k views Asked by At

I'm writing an RPC library for AVR and need to pass a function address to some inline assembler code and call the function from within the assembler code. However the assembler complains when I try to call the function directly.

This minimal example test.cpp illustrates the issue (in the actual case I'm passing args and the function is an instantiation of a static member of templated class):

void bar () {
    return;
}

void foo() {
    asm volatile (
        "call %0" "\n"
        :
        : "p" (bar)
    );
}

Compiling with avr-gcc -S test.cpp -o test.S -mmcu=atmega328p works fine but when I try to assemble with avr-gcc -c test.S -o test.o -mmcu=atmega328p avr-as complains:

test.c: Assembler messages:
test.c:38: Error: garbage at end of line

I have no idea why it writes "test.c", the file it is referring to is test.S, which contains this on line 38:

call gs(_Z3barv)

I have tried all even remotely sensible constraints on the paramter to the inline assembler that I could find here but none of those I tried worked.

I imagine if the gs() part was removed, everything should work, but all constraints seem to add it. I have no idea what it does.

The odd thing is that doing an indirect call like this assembles just fine:

void bar () {
    return;
}

void foo() {
    asm volatile (
        "ldi r30, lo8(%0)" "\n"
        "ldi r31, hi8(%0)" "\n"
        "icall" "\n"
        :
        : "p" (bar)
    );
}

The assembler produced looks like this:

ldi r30, lo8(gs(_Z3barv))
ldi r31, hi8(gs(_Z3barv))
icall

And avr-as doesn't complain about any garbage.

3

There are 3 answers

2
emacs drives me nuts On BEST ANSWER

There are several issues with the code:

Issue 1: Wrong Constraint

The correct constraint for a call target is "i", thus known at link-time.

Issue 2: Wrong % print-modifier

In order to print an address suitable for a call, use %x which will print a plain symbol without gs(). Generating a linker stub at this place by means of gs() is not valid syntax, hence "garbage at end of line". Apart from that, as you are calling bar directly, there is no need for linker stub (at least not for this kind of symbol usage).

Issue 3: call instruction might not be available

To factor out whether a device supports call or just rcall, there is %~ which prints a single r if just rcall is available, and nothing if call is available.

Issue 4: The Call might clobber Registers or have other Side-Effects

It's unlikely that the call has no effects on registers or on memory whatsoever. If you description of the inline asm does not match some side-effects of the code, it's likely that you will get wrong code sooner or later.

Taking it all together

Let's assume you have a function bar written in assembly that takes two 16-bit operands in R22 and R26, and computes a result in R22. This function does not obey the avr-gcc C/C++ calling convention, so inline assembly is one way to interface to such a function. For bar we cannot write a correct prototype anyways, so we just provide a prototype so that we can use symbol bar. Register X has constraint "x", but R22 has no own register constraint, and therefore we have to use a local asm register:

extern "C" void bar (...);

int call_bar (int x, int y)
{
    register int r22 __asm ("r22") = x;
    __asm ("%~call %x2"
           : "+r" (r22)
           : "x" (y), "i" (bar));
    return r22;
}

Generated code for ATmega32 + optimization:

_Z8call_barii:
    movw r26,r22
    movw r22,r24
    call bar
    movw r24,r22
    ret

So what's that "generate stub" gs() thing?

Suppose the C/C++ code is taking the address of a function. The only sensible thing to do with it is to call that function, which will be an indirect call in general. Now an indirect call can target 64KiW = 128KiB at most, so that on devices with > 128KiB of code memory, special means must be taken to indirectly call a function beyond the 128KiB boundary. The AVR hardware features an SFR named EIND for that purpose, but problems using it are obvious. You'd have to set it prior to a call and then reset it somehow somewhere; all evil things would be necessary.

avr-gcc takes a different approach: For each such address taken, the compiler generates gs(func). This will just resolve to func if the address is in the 128KiB range. If not, gs() resolves to an address in section .trampolines which is located close to the beginning of flash, i.e. in the lower 128KiB. .trampolines containts a list of direct JMPs to targets beyond the 128KiB range.

Take for example the following C code:

extern int far_func (void);

int main (void)
{
    int (*pfunc)(void) = far_func;
    __asm ("" : "+r" (pfunc)); /* Forget content of pfunc. */
    return pfunc();
}

The __asm is used to keep the compiler from optimizing the indirect call to a direct one. Then run

> avr-gcc main.c -o main.elf -mmcu=atmega2560 -save-temps -Os -Wl,--defsym,far_func=0x24680
> avr-objdump -d main.elf > main.lst

For the matter of brevity, we just define symbol far_func per command line. The assembly dump in main.s shows that far_func might require a linker stub:

main:
    ldi r30,lo8(gs(far_func))
    ldi r31,hi8(gs(far_func))
    eijmp

The final executable listing in main.lst then shows that the stub is actually generated and used:

main.elf:     file format elf32-avr

Disassembly of section .text:
...

000000e4 <__trampolines_start>:
  e4:   0d 94 40 23     jmp 0x24680 ; 0x24680 <far_func>

...

00000104 <main>:
 104:   e2 e7           ldi r30, 0x72   ; 114
 106:   f0 e0           ldi r31, 0x00   ; 0
 108:   19 94           eijmp

main loads Z=0x0072 which is a word address for byte address 0x00e4, i.e. the code is indirectly jumping to 0x00e4, and from there it jumps directly to 0x24680.

3
JimmyB On

Note that call requires a constant, known-at-link-time value. The "p" constraint does not include that semantics; it would also allow a pointer from a variable (e.g. char* x), which call cannot handle. (I seem to remember that sometimes gcc is clever enough to optimize in such a way that "p" does work here - but that's basically undocumented behavior and non-deterministic, so better not count on it.)

If the function you're calling actually is compile-time constant you can use "i" (bar). If it's not, then you have no other choice than using icall as you already figured out.

Btw, the AVR section of https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints documents some more, AVR-specific constraints.

0
nqtronix On

I've tries various ways of passing a C function name to inline ASM code without success. However I did find a workaround, which seems to provide the desired result.


Answer to the question:

As explained on https://www.nongnu.org/avr-libc/user-manual/inline_asm.html you can assign a ASM name to a C function in a prototype declaration:

void bar (void) asm ("ASM_BAR");    // any name possible here
void bar (void)
{
    return;
}

Then you can call the function easily from your ASM code:

asm volatile("call ASM_BAR");

Use with library functions:

This approach does not work with library functions, because they have their own prototype declarations. To call a function like system_tick() of the time.h library more efficiently from an ISR, you can declare a helper function. Unfortunately GCC does not apply the inline setting to calls from ASM code.

inline void asm_system_tick(void) asm ("ASM_SYSTEM_TICK") __attribute__((always_inline));
void asm_system_tick(void)
{
    system_tick();
}

In the following example GCC does only generate push/ pop instructions for the surrounding code, not for the function call! Note that system_tick() is specifically designed for ISR_NAKED and does all required stack operations on its own.

volatile uint8_t tick = 0;
ISR(TIMER2_OVF_vect)
{
    tick++;
    if (tick > 127)
    {
        tick = 0;
        asm volatile ("call ASM_SYSTEM_TICK");
    }
}

Because the inline attribute does not work, each function call takes 8 additional cpu cycles. Compared to 5632 CPU cycles required for push/ pull operations with a normal function call (44 CPU cycles for each run of the ISR) it is still a very impressive improvement.