Optimizing static library with LTO at the public API boundary

268 views Asked by At

Intro

I have a set of fairly large and complex static libraries for an embedded target that I want to optimize with LTO at the boundary of the public API. I have a project with the following layout:

directory layout

The public.h header file contains public API symbols and have default visibility.

#pragma once

#pragma GCC visibility push(default)

void public_function_1(int a);
void public_function_2(void);
int public_function_3(int a, int b, int* c);

#pragma GCC visibility pop

The internal.h contains internal symbols that are never used outside the library.

#pragma once

int internal_function_1(int a);
void internal_function_2(int a, int b);
int internal_function_3(int a, int b);

Content of public.c is not that important, but its included here for fuller picture:

#include "public.h"
#include "internal.h"

static int some_state = 1;

void public_function_1(int a)
{
        some_state *= a;
        some_state++;
}

void public_function_2(void)
{
        internal_function_2(10, 20);
        some_state = internal_function_1(some_state);
}

int public_function_3(int a, int b, int* c)
{
        internal_function_2(20, 50);
        *c = internal_function_3(a, b);
        return *c + some_state;
}

Content of internal.c is irrelevant.

Building the library

I know that static libraries are merely an archive of object files, so here is my plan to optimize them with LTO:

  1. Compile public.c and internal.c with the following command:
arm-zephyr-eabi-gcc -ffunction-sections -fdata-sections -Os -flto -g3 -fvisibility=internal -fdata-sections -ffunction-sections -c internal.c -o internal.c.obj <arch flags>
arm-zephyr-eabi-gcc -ffunction-sections -fdata-sections -Os -flto -g3 -fvisibility=internal -fdata-sections -ffunction-sections -c public.c -o public.c.obj <arch flags>
  1. Partially link the object files with -r -flinker-output=nolto-rel:
arm-zephyr-eabi-gcc -ffunction-sections -fdata-sections -Os -flto -g3 -r -flinker-output=nolto-rel internal.c.obj public.c.obj -o lib
  1. Put the resulting object file into a library with ar.

The -flinker-output=nolto-rel instructs GCC to output machine code, instead of GIMPLE IR.

I'm using the arm-zephyr-eabi toolchain, but it should work the same for plain system gcc as well. The GCC version is 12.1.0.


I ended up with an object file that more or less contains what I need with a caveat. Here's a disassembly:

lib:     file format elf32-littlearm


Disassembly of section .text.internal_function_1:

00000000 <internal_function_1>:
   0:   220a        movs    r2, #10
   2:   4b02        ldr r3, [pc, #8]    ; (c <internal_function_1+0xc>)
   4:   6818        ldr r0, [r3, #0]
   6:   4350        muls    r0, r2
   8:   6018        str r0, [r3, #0]
   a:   4770        bx  lr
   c:   00000000    .word   0x00000000

Disassembly of section .text.internal_function_2:

00000000 <internal_function_2>:
   0:   4b01        ldr r3, [pc, #4]    ; (8 <internal_function_2+0x8>)
   2:   4408        add r0, r1
   4:   6018        str r0, [r3, #0]
   6:   4770        bx  lr
   8:   00000000    .word   0x00000000

Disassembly of section .text.internal_function_3:

00000000 <internal_function_3>:
   0:   4408        add r0, r1
   2:   4770        bx  lr

Disassembly of section .text.public_function_1:

00000000 <public_function_1>:
   0:   4b02        ldr r3, [pc, #8]    ; (c <public_function_1+0xc>)
   2:   681a        ldr r2, [r3, #0]
   4:   4350        muls    r0, r2
   6:   3001        adds    r0, #1
   8:   6018        str r0, [r3, #0]
   a:   4770        bx  lr
   c:   00000000    .word   0x00000000

Disassembly of section .text.public_function_2:

00000000 <public_function_2>:
   0:   f44f 7396   mov.w   r3, #300    ; 0x12c
   4:   4a02        ldr r2, [pc, #8]    ; (10 <public_function_2+0x10>)
   6:   6013        str r3, [r2, #0]
   8:   4a02        ldr r2, [pc, #8]    ; (14 <public_function_2+0x14>)
   a:   6013        str r3, [r2, #0]
   c:   4770        bx  lr
   e:   bf00        nop
    ...

Disassembly of section .text.public_function_3:

00000000 <public_function_3>:
   0:   b510        push    {r4, lr}
   2:   2446        movs    r4, #70 ; 0x46
   4:   4b03        ldr r3, [pc, #12]   ; (14 <public_function_3+0x14>)
   6:   4408        add r0, r1
   8:   601c        str r4, [r3, #0]
   a:   4b03        ldr r3, [pc, #12]   ; (18 <public_function_3+0x18>)
   c:   6010        str r0, [r2, #0]
   e:   681b        ldr r3, [r3, #0]
  10:   4418        add r0, r3
  12:   bd10        pop {r4, pc}
    ...

The problem

As you can see in the disassembly, the internal_function_xyz symbols have been inlined into the body of the public functions, which means that LTO works correctly. What I'm not happy about is that the internal_function_xyz symbols along with the machine code are still present in the object file. I expected that the linker would discard those symbols, since those were marked with visibility internal or hidden. The output of nm shows the following:

00000000 T internal_function_1
00000000 T internal_function_2
00000000 T internal_function_3
00000000 T public_function_1
00000000 T public_function_2
00000000 T public_function_3
00000000 d some_state.lto_priv.0
00000000 d some_state.lto_priv.1

This means that although the symbols had internal or hidden visibility, the symbols were still externally visible in the symbol table. My suspicion is that this caused the linker to keep those symbols. I wanted to get rid of those symbols using objcopy and strip like so:

arm-zephyr-eabi-objcopy --localize-hidden lib localized_symbols

The symbol table now looks like this:

00000000 t internal_function_1
00000000 t internal_function_2
00000000 t internal_function_3
00000000 T public_function_1
00000000 T public_function_2
00000000 T public_function_3
00000000 d some_state.lto_priv.0
00000000 d some_state.lto_priv.1

After the following command:

arm-zephyr-eabi-strip --strip-unneeded localized_symbols

I end up with:

00000000 T public_function_1
00000000 T public_function_2
00000000 T public_function_3

However, in the disassembly the machine code still remains:

localized_symbols:     file format elf32-littlearm


Disassembly of section .text.internal_function_1:

00000000 <.text.internal_function_1>:
   0:   220a        movs    r2, #10
   2:   4b02        ldr r3, [pc, #8]    ; (c <.text.internal_function_1+0xc>)
   4:   6818        ldr r0, [r3, #0]
   6:   4350        muls    r0, r2
   8:   6018        str r0, [r3, #0]
   a:   4770        bx  lr
   c:   00000000    .word   0x00000000

Disassembly of section .text.internal_function_2:

00000000 <.text.internal_function_2>:
   0:   4b01        ldr r3, [pc, #4]    ; (8 <.text.internal_function_2+0x8>)
   2:   4408        add r0, r1
   4:   6018        str r0, [r3, #0]
   6:   4770        bx  lr
   8:   00000000    .word   0x00000000

Disassembly of section .text.internal_function_3:

00000000 <.text.internal_function_3>:
   0:   4408        add r0, r1
   2:   4770        bx  lr

Disassembly of section .text.public_function_1:

00000000 <public_function_1>:
   0:   4b02        ldr r3, [pc, #8]    ; (c <public_function_1+0xc>)
   2:   681a        ldr r2, [r3, #0]
   4:   4350        muls    r0, r2
   6:   3001        adds    r0, #1
   8:   6018        str r0, [r3, #0]
   a:   4770        bx  lr
   c:   00000000    .word   0x00000000

Disassembly of section .text.public_function_2:

00000000 <public_function_2>:
   0:   f44f 7396   mov.w   r3, #300    ; 0x12c
   4:   4a02        ldr r2, [pc, #8]    ; (10 <public_function_2+0x10>)
   6:   6013        str r3, [r2, #0]
   8:   4a02        ldr r2, [pc, #8]    ; (14 <public_function_2+0x14>)
   a:   6013        str r3, [r2, #0]
   c:   4770        bx  lr
   e:   bf00        nop
    ...

Disassembly of section .text.public_function_3:

00000000 <public_function_3>:
   0:   b510        push    {r4, lr}
   2:   2446        movs    r4, #70 ; 0x46
   4:   4b03        ldr r3, [pc, #12]   ; (14 <public_function_3+0x14>)
   6:   4408        add r0, r1
   8:   601c        str r4, [r3, #0]
   a:   4b03        ldr r3, [pc, #12]   ; (18 <public_function_3+0x18>)
   c:   6010        str r0, [r2, #0]
   e:   681b        ldr r3, [r3, #0]
  10:   4418        add r0, r3
  12:   bd10        pop {r4, pc}
    ...

Question

Is there any way I can optimize the library with LTO, keep only public API symbols in the symbol table and not have any redundant internal symbols and machine code?

Additional concerns

Since in my case the gcc ends up generating machine code for internal symbols and puts them into a symbol table, I suspect that the optimization might not be done to the fullest extent. Let's say that internal_function_1 were much larger in size. Let's also assume that the function internal_function_1 is referenced only once within the library. If the linker sees more than one reference to the symbol (one from library code, unknown number of references from externally linked code due to being present in symbol table), the optimizer may be more reluctant to inline such internal function and will not perform aggressive optimizations on it. I haven't confirmed this hypothesis yet, but if it's true, then I think the only reasonable solution would involve preventing those internal symbols from ever being generated and inserted into the symbol table at the relocatable link stage.

0

There are 0 answers