I am compiling some embedded software with speed and space constraints, using a pic32 port of gcc. Due to the space constraints, I need to optimize for size (-Os) or the image won't fit (e.g. with -O2). Link-Time Optimizations (-flto) further reduce the code size, compared with -Os only, but also reduce the speed significantly (20-25%).
How can I tune the compiler optimizations to find different space-speed trade-offs that suit my target? gcc has many options and parameters that may affect speed and size on different targets and the documentation is minimal. A first attempt at enabling the options used for -O2 and not -Os (-falign-functions -falign-jumps -falign-labels -falign-loops -fprefetch-loop-arrays -freorder-blocks-algorithm=stc) has provided no noticeable improvement.
The target processor is a MIPS32 without cache. The application is bare-metal and I cannot modify it to make it faster or smaller.