Optimising C code for small size - sharing static variables?

206 views Asked by At

I have two functions, both are similar to this:

void Bit_Delay()
{
    //this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
    volatile char z = 12;
    
    while(z)
    {
        z++;
        z++;
        z++;
        z++;
        z -= 5;
    }
}

(The second function is analogous instead it uses 18 instead of 12 for the counter).

The code works flawlessly as it is (with z appearing locally to each function internally), but I'm trying to cram a little more functionality into my executable before I hit the (horribly) limited FLASH memory available.

My thought was to promote the z variable to be a global one (a volatile static). Because these two functions are effectively atomic operations (it's a single-threaded CPU and there are no interrupts at play to interfere), I figured that these two functions could share the single variable, thus saving a tiny bit of stack manipulation.

This didn't work. It is clear that the compiler is optimising-out much of the code related to z completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.

I realised that I needed the z variable to be marked volatile to prevent the compiler from removing code it knows is counting a fixed (and thus reducible to a constant) number each time.

Question:

Can I optimise this any further, and trick the compiler into keeping both functions intact? I'm compiling with "-Os" (optimise for small binary).

Here's the entire program verbatim for those playing along at home...

#include <avr/io.h>

#define RX_PIN (1 << PORTB0) //physical pin 3
#define TX_PIN (1 << PORTB1) //physical pin 1

void Bit_Delay()
{
    //this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
    volatile char z = 12;
    
    while(z)
    {
        z++;
        z++;
        z++;
        z++;
        z -= 5;
    }
}

void Serial_TX_Char(char c)
{
    char i;
    
    //start bit
    PORTB &= ~TX_PIN;
    Bit_Delay();
    
    for(i = 0 ; i < 8 ; i++)
    {
        //output the data bits, LSB first
        if(c & 0x01)
            PORTB |= TX_PIN;
        else
            PORTB &= ~TX_PIN;
        
        c >>= 1;
        Bit_Delay();
    }

    //stop bit  
    PORTB |= TX_PIN;
    Bit_Delay();
}

char Serial_RX_Char()
{
    char retval = 0;
    volatile char z = 18; //1.5 bits delay

    //wait for idle high
    while((PINB & RX_PIN) == 0)
    {}
    
    //wait for start bit falling-edge
    while((PINB & RX_PIN) != 0)
    {}

    //1.5 bits delay
    while(z)
    {
        z++;
        z++;
        z++;
        z++;
        z -= 5;
    }

    for(z = 0 ; z < 8 ; z++)
    {
        retval >>= 1; //make space for the new bit
        retval |= (PINB & RX_PIN) << (8 - RX_PIN); //get the bit and store it
        Bit_Delay();
    }
    
    return retval;      
}

int main(void)
{
    CCP = 0xd8; //protection signature for clock registers (see datasheet)
    CLKPSR = 0x00; //set the clock prescaler to "div by 1"
    DDRB |= TX_PIN;
    PORTB |= TX_PIN; //idle high
        
    while (1) 
        Serial_TX_Char(Serial_RX_Char() ^ 0x20);
}

The target CPU is an Atmel ATTiny5 microcontroller, the code above uses up 94.1% of the FLASH memory! If you connect to the chip using a serial port at 9600 Baud, 8N1, you can type characters in and it returns them with bit 0x20 flipped (uppercase to lowercase and vice-versa).

This is not a serious project of course, I'm just experimenting to see how much functionality I could cram into this chip. I'm not going to bother with rewriting this in assembly, I seriously doubt I could do a better job than GCC's optimiser!

EDIT

@Frank asked about the IDE / compiler I'm using...

Microchip Studio (7.0.2542)

The "All Options" string that is passed to the compiler avr-gcc...

-x c -funsigned-char -funsigned-bitfields -DDEBUG  -I"C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\include"  -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -g2 -Wall -mmcu=attiny5 -B "C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\gcc\dev\attiny5" -c -std=gnu99 -MD -MP -MF "$(@:%.o=%.d)" -MT"$(@:%.o=%.d)" -MT"$(@:%.o=%.o)" 
2

There are 2 answers

7
AudioBubble On BEST ANSWER

I question the following assumption:

This didn't work. It is clear that the compiler is optimising-out much of the code related to z completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.

Looking at https://gcc.godbolt.org/z/sKdz3h8oP, it seems like the loops are actually being performed, however, for whatever reason each z++, when using a global volatile z goes from:

subi r28,lo8(-(1))
sbci r29,hi8(-(1))
ld r20,Y
subi r28,lo8((1))
sbci r29,hi8((1))
subi r20,lo8(-(1))
subi r28,lo8(-(1))
sbci r29,hi8(-(1))
st Y,r20
subi r28,lo8((1))
sbci r29,hi8((1))

to:

lds r20,z
subi r20,lo8(-(1))
sts z,r20

You will need to recalibrate your 12, 18, and 5 constants to get your baud rate correct (since fewer instructions are executed in each loop), but the logic is there in the compiled version.

To be clear: This looks really weird to me, the local volatile version is clearly not being compiled correctly. I did find an old gcc bug along these lines: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33970, but it seems to not cover the local variable case.

0
emacs drives me nuts On

Can I optimise this any further,

Of course; code like this is extremely expensive — and fagile:

volatile char z = 12;
    
while(z)
{
    z++;
    z++;
    z++;
    z++;
    z -= 5;
}

It's expensive because your are asking for code bloat just to waste some cycles. And it's fragile because minimal changes to the code might change the timings. Apart from that it triggers stack frames because local volatile vars will live in the stack frame.

To make things worse, you are using volatile char z as a loop variable!

Why to use _delay_ms and friends

AVR-LibC provides delay routines like _delay_us and delay_ms in <util/delay.h>. Advantage is that:

  • A specified amount of time is wasted, which is passed as a parameter. (The routines might waste more real time than expected if interrupts are on.)

  • Code size is minimal due to inline assembly, compiler built-ins like __builtin_avr_delay_cycles and code folding.

  • No more need for magic numbers like 12 or 18 in your code.

The delay time must evaluate to a compile-time constant, and optimization must be turned on1. Canonical use case is to compute delay time using F_CPU, baud rate etc. Suppose we want to delay x * 1/BAUD seconds, which is x * 1000000 / BAUD µs.

So let's change Bit_Delay to the following code, where we add -D F_CPU=8000000 to the command line options so that the delay routines have it available:

#define BAUD 9600

__attribute__((always_inline))
static inline void Bit_Delay (double x)
{
    _delay_us (x * 1000000 / BAUD);
}

Then use it like Bit_Delay (1). Changing the 4 usages to that, the code size drops from 480 Bytes to 360 Bytes.

Also adjusting the wait for 1.5 bits to Bit_Delay (1.5) and fixing the loop variable to not be volatile, the code size drops to 180 Bytes.

Functions Serial_RX_Char and Serial_TX_Char are just called once statically, so the compiler can inline them provided we make them static. This reduces the code size further to 170 Bytes, 44 bytes of which are start-up code and vector table. Moreover, we do no more need stack frames (which were triggered by local volatile vars), and the function calls are inlined, which saves RAM. Not unimportant on a device like ATtiny5 with just 32 Bytes of RAM.

FYI, the (inlined) delay code is compiled as:

    ldi r20,lo8(208)
    ldi r21,hi8(208)
1:  subi r20,1
    sbci r21,0
    brne 1b
    nop

Magic 208 is basically F_CPU / BAUD / 4 folded by the compiler, where the division by 4 is because one turn of the delay loop takes 4 ticks.

Why not to use _delay_ms and friends

Busy-waiting is a waste of time and energy. For very short timing issues it might be in order, longer busy-wait may destroy timings of other parts of the code, because it blocks their execution. If possible, use timers + interrupts for that purpose; it saved energy (when sleep can be used when idling) and does not delay execution of unrelated code.


1So that code folding works as expected. Otherwise, the delay code will complain:

util/delay.h:112:3: warning: Compiler optimizations disabled; functions from <util/delay.h> won't work as designed