I attempted to write a sort algorithm to reorder instructions for a dual issue processor (Cell SPU). One way to obtain dual issue processing an instruction should not depend on the instruction that precedes it (another involves separate pipelines, but I'm focused on instructions in the same pipeline). I understand this would be too much for the compiler to do and I didn't find what I needed when searching. This can be done by hand in most cases but a sorting algorithm should ensure the lowest "sequence count" (number or dependent instructions that follow each-other).
My question is has this or something similar been done before? Is there an optimized approach?
Simple example pseudo-code halving instruction time (inputs: i1, i2, i3):
v1 = i1 ^ i2; - #single-issued
v2 = v1 | i2; \ #v2,v3 dual-issued
v3 = i1 & i3; / #v2,v3 dual-issued
v4 = v3 & i2; - #single-issued
can be written as:
v1 = i1 ^ i2; \ #v1,v3 dual-issued
v3 = i1 & i3; / #v1,v3 dual-issued
v2 = v1 | i2; \ #v2,v4 dual-issued
v4 = v3 & i2; / #v2,v4 dual-issued
Here is a python implementation I created that recursively reorders the instructions to achieve the lowest "sequence count".
reorder.py
http://pastebin.com/dt8eWy3H
sample t8-1.h
http://pastebin.com/w0DYg8ff
While I can't speak specifically for the Cell, code scheduling is ABSOLUTELY something the compiler should be doing for you.
Compilers will re-order instructions, pad in NOPS as required, and do everything it can to provide a good code schedule for you. Normally, I'd tell you to look at the "mtune" parameters for your compiler (they allow you to tell your compiler exactly what your processor looks like), but since you're coding for the Cell it should already know what to do (but do check the compiler manual just to be sure).
A brief glance at the GCC compiler for the SPU here shows options such as:
As a programmer, it is your job to provide enough "ILP" in your code to get good scheduling. Try to avoid branches, avoid putting long latency operations on the critical path, etc., and you should be fine. Analyze the objdump of your critical loops to verify the code is being scheduled as you desire. The compiler is very smart, but it may require a little coaxing.