How to include and translate custom instructions/extension on standard C/C++ code keeping performance high

678 views Asked by At

I'm developing a general purpose image processing core for FPGAs and ASICs. The idea is to interface a standard processor with it. One of the problems I have is how to "program" it. Let me explain: The core has a instruction decoder for my "custom" extensions. For instance:

vector_addition $vector[0], $vector[1], $vector[2]    // (i.e. v2 = v0+v1) 

and many more like that. This operation is sended by the processor through the bus to the core, using the processor for loops, non-vector operations, etc, like that:

for (i=0; i<15;i++)           // to be executed in the processor
     vector_add(v0, v1, v2)   // to be executed in my custom core

Program is written in C/C++. The core only need the instruction itself, in machine code

  1. opcode = vector_add = 0x12h
  2. register_src_1 = v0 = 0x00h
  3. register_src_2 = v1 = 0x01h
  4. register_dst = v2 = 0x02h

    machine code = opcore | v0 | v1 | v2 = 0x7606E600h

(or whatever, just a contatenation of different fields to build the instruction in binary)

Once sending it through the bus to the core, the core is able to request all data from memory with dedicated buses and to handle everything without use the processor. The big cuestion is: how can I translate the previous instruction to its hexadecimal representation? (send it throught the bus is not a problem). Some options that come to mind are

  • Run interpreted code (translate to machine code at runtime in the processor) --> very slow, even using some kind of inline macro
  • Compile the custom sections with an external custom compiler, load the binary from the external memory and move it to the core with some unique instruction --> hard to read/understand source code, poor SDK integration, too many sections if code is very segmented
  • JIT compilation --> to complex just for this?
  • Extending the compiler --> a nightmare!
  • A custom processor connected to the custom core to handle everything: loops, pointers, memory allocation, variables... --> too much work

The problem is about software/compilers, but for those that have deep knowledge in this topic, this is a SoC in an FPGA, the main processor is a MicroBlaze and the IP Core employes AXI4 buses.

I hope I explained it correctly... Thanks in advance!

3

There are 3 answers

3
rodrigo On

Couldn't you translate your all your sections of code to machine code at the start of the program (just once), save them in binary format in blocks of memory and then use those binaries when needed.

That's basically how the OpenGL shaders work, and I find that quite easy to manage.

The main drawback is the memory consumption, as you have in memory both the text and binary representation of the same scripts. I don't know if this is a problem for you. If it is, there are partial solutions, as unloading the source texts once they are compiled.

2
edA-qa mort-ora-y On

I'm not sure I entirely understand, but I think I've been faced with something similar before. Based on the comment to rodrigo's response it sounds like you have small instruction pieces scattered through your code. You also mention an external compiler is possible, just a pain. If you combine the external compiler with a C macro you can get something decent.

Consider this code:

for (i=0; i<15;i++)
     CORE_EXEC(vector_add(v0, v1, v2), ref1)

The CORE_EXEC macro will serve two purposes:

  1. You can use an external tool to scan your source files for these entries and compile the core code. This code will be linked to C (just produce a C file with binary bits) using the "ref1" name as a variable.
  2. In C you'll define the CORE_EXEC macro to pass the "ref1" string to the core for processing.

So stage 1 will produce a file of compiled binary core instructions, for example the above might have a string like this:

const char * const cx_ref1[] = { 0x12, 0x00, 0x01, 0x02 };

And you might define CORE_EXEC like this:

#define CORE_EXEC( code, name ) send_core_exec( cx_##name )

Obviously you can choose the prefixes however you want, though in C++ you might wish to use a namespace instead.

In terms of toolchain you could produce one file for all your bits or produce one file per C++ file -- which might be easier to dirty detection. Then you can simply include the generated files in your source code.

1
old_timer On

Lets say I was going to modify an arm core to add some custom instructions, and the operations I wanted to run were known at compile time (will get to runtime in a sec).

I would use assembly, for example:

.globl vecabc
vecabc:
   .word 0x7606E600 ;@ special instruction
   bx lr

or inline it with whatever the inline syntax is for your compiler it makes it harder if you need to use processor registers for example where the c compiler fills in the registers in the inline assembly language then the assembler assembles those instructions. I find writing actual asm and just injecting the words in the instruction stream as above, only the compiler distingushes some bytes as data and some bytes as instructions, the core will see them in order as written.

If you need to do things real time you can use self-modifying-code, again I like to use asm to trampoline. Build the instructions you want to run somewhere in ram, say at address 0x20000000 then have a trampoline call it:

.globl tramp
tramp:
    bx r0 ;@ assuming you encoded a return in your instructions

call it with

tramp(0x20000000);

An other path related one above is to modify the assembler to add the new instructions, create a syntax for those instructions. Then you can use straight assembly language or inline assembly language at will, you wont get the compiler to use them without modifying the compiler, which is another path to take after the assembler has been modified.