I have an application written in C with inline assembly for a Xilinx Microblaze core. My inline assembly has a delay task. Function "_delay_loop_X_x" delays exactly 4 cycles per loop of processor. The input signal determines the number of loop to be made. Function "_NOPx" is to achieve more precision. The function works fine, but at the end of the signal it gives twice extra delay. I'm afraid I use registers incorrectly. Can someone please check my assembly code?
For Microblaze I use this documentation: https://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf
Assembler code:
static __inline__ void _delay_loop_1_x( uint8_t) __attribute__((always_inline));
static __inline__ void _NOP1 (void) {__asm__ volatile ("nop \n\t" ); } //1 cycle
static __inline__ void _NOP2 (void) {__asm__ volatile ("beqi r12, 1f \n\t""1:\n\t" ::: "r12", "cc" ); } //2 cycle
static __inline__ void _NOP3 (void) {__asm__ volatile ("brk r12, r0 \n\t" ::: "r12", "cc" ); } //3 cycle
static __inline__ void /* exactly 4 cycles */
_delay_loop_1_x( uint8_t __n )
{ /* cycles per loop */
__asm__ volatile (
" addik r11, r0, 1 \n\t" /* 1 */
"1: rsub %[input], r11, %[input] \n\t" /* 1 */
" beqi %[input], 2f \n\t" /* 1 */
"2: bnei %[input], 1b \n\t" /* 1 */
: /* ----- */
: [input]"r" (__n) /* ----- */
: "r11", "cc" /* 4 */
);
}
static __inline__ void /* exactly 4 cycles/loop */
_delay_loop_2_x( uint16_t __n )
{ /* cycles per loop */
__asm__ volatile ( /* __n..one */
" addik r11, r0, 1 \n\t" /* 1 */
"1: rsub %[loops], r11, %[loops] \n\t" /* 1 */
" beqi %[loops], 2f \n\t" /* 1 */
"2: bnei %[loops], 1b \n\t" /* 1 */
: /* ----- */
: [loops]"r" (__n) /* ----- */
: "r11", "cc" /* 4 */
);
}
static __inline__ void
_delay_cycles(const double __ticks_d)
{
uint32_t __ticks = (uint32_t)(__ticks_d);
uint32_t __padding;
uint32_t __loops;
if( __ticks <= 3 ) {
__padding = __ticks;
} else if( __ticks <= 0x400 ) {
__ticks -= 1;
__loops = __ticks / 4;
__padding = __ticks % 4;
if( __loops != 0 )
_delay_loop_1_x( (uint8_t)__loops );
} else if( __ticks <= 0x40001 ) {
__ticks -= 2;
__loops = __ticks / 4;
__padding = __ticks % 4;
if( __loops != 0 )
_delay_loop_2_x( (uint16_t)__loops );
}
if( __padding == 1 ) _NOP1();
if( __padding == 2 ) _NOP2();
if( __padding == 3 ) _NOP3();
}
C code:
#define _delay_ns(__ns) _delay_cycles( (double)(F_CPU)*((double)__ns)/1.0e9 + 0.5 )
#define _delay_us(__us) _delay_cycles( (double)(F_CPU)*((double)__us)/1.0e6 + 0.5 )
#define _delay_ms(__ms) _delay_cycles( (double)(F_CPU)*((double)__ms)/1.0e3 + 0.5 )
#define BIT_DELAY_1 _delay_ns(2070)
#define BIT_DELAY_5 _delay_us(19)
#define BIT_DELAY_7 _delay_us(26)
#define RX_TX_DELAY _delay_us(78)
#define SHA204_SWI_FLAG_TX ((uint8_t) 0x88)
XGpio GpioPIN;
uint8_t swi_send_bytes(uint8_t count, uint8_t *buffer);
uint8_t swi_send_byte(uint8_t value);
int main()
{
init_platform();
XGpio_Initialize(&GpioPIN, GPIO_PIN_DEVICE_ID);
XGpio_SetDataDirection(&GpioPIN, PIN_CHANNEL, ~PIN);
(void) swi_send_byte(SHA204_SWI_FLAG_TX);
cleanup_platform();
return 0;
}
uint8_t swi_send_byte(uint8_t value)
{
return swi_send_bytes(1, &value);
}
uint8_t swi_send_bytes(uint8_t count, uint8_t *buffer)
{
uint8_t i, bit_mask;
RX_TX_DELAY;
for (i = 0; i < count; i++) {
for (bit_mask = 1; bit_mask > 0; bit_mask <<= 1) {
if (bit_mask & buffer[i]) {
XGpio_DiscreteClear(&GpioPIN, PIN_CHANNEL, PIN);
BIT_DELAY_1;
XGpio_DiscreteWrite(&GpioPIN, PIN_CHANNEL, PIN);
BIT_DELAY_7;
}
else {
XGpio_DiscreteClear(&GpioPIN, PIN_CHANNEL, PIN);
BIT_DELAY_1;
XGpio_DiscreteWrite(&GpioPIN, PIN_CHANNEL, PIN);
BIT_DELAY_1;
XGpio_DiscreteClear(&GpioPIN, PIN_CHANNEL, PIN);
BIT_DELAY_1;
XGpio_DiscreteWrite(&GpioPIN, PIN_CHANNEL, PIN);
BIT_DELAY_5;
}
}
}
return 0;
}
My result: https://i.stack.imgur.com/zu1vV.jpg
Thinking about it more I don't really see how those loops should be Nx4 cycles.
Once the C part is over (prologue of function) and the code will start execute the ASM part, I see:
for n = 1 (and let's say compiler will use
r1
forinput
):The executed instructions (step by step) will go I guess like this:
Then any remaining instructions from C will follow (epilogue of function).
If each instruction is 1 cycle (there's lot of dependencies, so on high frequency modern CPU that would be very unlikely, but if microblaze is low frequency simple RISC architecture with no multi-stage pipeline, then it may work like that), then you have 4 cycles.
for n = 2
That's 7 instructions, not 8. For 8 you need to
bnei
to theaddik
to re-setr11
again to1
for the delay purposes (even if the value is already set inr11
).In any case this makes me wonder whether the timing of this CPU is really that simple (1 instruction = 1 cycle, even when branching), and why don't you simplify the loop to simple countdown (and use rather native 32b type for
input
):Then you have 2-cycles delay loop.
But the main code using the delays... has lot of C code, which will translate to machine code instructions too, and those needs some time to execute as well, so it's not clear what you are measuring, and why, and if you did take into account execution delay caused by those additional instructions.
UPDATE
About
ticks -= 1
crash.. no idea, doesn't make sense, so you have to go into debugger and find out the true reason on machine level.But the whole thing doesn't make much sense, as you try to split cycle delay argument
ticks_d
by C expressions, like that block ofif (padding == ...)
. And those conditional tests will take many more cycles then the actual padding value, so there's no point to call theNOPx();
to delay even further, as you are already delayed by several tens of cycles more.Also now I finally found that
double
Olaf mentioned, that alone (conversion to integer for the remaining math) of course incurs huge performance penalty, probably even in hundreds of cycles. So the wholevoid delay_cycles(const double ticks_d)
will actually delay many many more cycles than you expect, the ASM part will be negligible in the total time (except that inner loop eating about ~3 cycles per the argument value, that one may ramp up the total delay considerably if the initial argument was big enough).If you really need instruction-cycle accuracy (not that common, last time I did need this was at 8 bit computers around 1990), you have to write it without C in pure asm, and count also the preparation/logic instructions into the delay - written ideally in fixed-exec-time manner, so calling
delay_cycles
function will incur fixed overhead for anyticks_d
argument value.You can then count that fixed overhead, and start by doing
ticks -= <overhead>;
first to have exact delay later. And you must switch fromdouble
to integers, as the Microblaze CPU doesn't have FPU unit, so the floating values are emulated by integer math, just the initial conversion to intint ticks = (int)(ticks_d);
must cost quite some cycles, you should see that in debugger, if you have disassembly view and step into it by single instruction.