Would doubling the amount of instructions in a VLIW allow for a processor to achieve double the performance since it can execute twice as many operations in parallel?
The answer depends on type of the calculations. Let us say that we have only one ALU on our machine. Imagine we have code that counts sum of an array:
for(int i = 0; i < len; i++)
{
sum += arr[i]
}
The pseudo assembly will look like the following:
; tick 0:
LD arr[i] -> %r0 ; load value from memory to register on ALU0
; tick 1:
ADD sum, %r0 -> sum ; increment sum value on ALU0
The loop body takes 2 ticks. If we double the ALU number and unroll the loop body we will get the following situation:
; tick 0:
LD arr[i] -> %r0 ; load value from memory to register on ALU0
LD arr[i+1] -> %r1 ; load value from memory to register on ALU1
; tick 1:
ADD sum, %r0 -> sum ; increment sum value on ALU0
; tick 2:
ADD sum, %r1 -> sum ; increment sum value on ALU0
Now we can see that the loop body takes 3 ticks. It is possible to make parallel load but the calculation itself can no be paralleled because its result depends on the previous loop iteration. So we do no double the performance with doubling the number of ALUs.
Now lets look to another example - the sum of two vectors:
for(int i = 0; i < len; i++)
{
c[i] = a[i] + b[i]
}
Let us look an the pseudo assembly:
; tick 0:
LD a[i] -> %r0 ; load value a[i] on ALU0
; tick 1:
LD b[i] -> %r1 ; load value b[i] on ALU0
; tick 2:
ADD %r0, %r1 -> %r2 ; add values on ALU0
; tick 3:
ST c[i] <- %r2 ; store value to c[i] on ALU0
We count the body in 4 ticks. What happens if we double the number of ALUs? In this case we do not have dependencies on the previous calculations. So we can unroll the body of the loop and have the following code:
; tick 0:
LD a[i] -> %r0 ; load value a[i] on ALU0
LD b[i] -> %r1 ; load value b[i] on ALU1
; tick 1:
LD a[i] -> %r0 ; load value a[i] on ALU0
LD b[i] -> %r1 ; load value b[i] on ALU1
; tick 2:
ADD %r0, %r1 -> %r2 ; add values on ALU0
ADD %r0, %r1 -> %r2 ; add values on ALU1
; tick 3:
ST c[i] <- %r2 ; store value to c[i] on ALU0
ST c[i] <- %r2 ; store value to c[i] on ALU1
We still have 4 ticks but in the 4 ticks we count 2 loop iterations. So we can say that doubling the ALU numbers doubled our performance.
These simple examples only illustrate the idea that the instruction level parallelism depends on the particular algorithm and just doubling the ALUs may ton lead to doubling the performance.
In more complex cases VLIW systems have to implement complex optimizing compiler that can do optimizations that non-VLIW systems implement in the hardware. In some cases it works better in some - worse.
The answer depends on type of the calculations. Let us say that we have only one ALU on our machine. Imagine we have code that counts sum of an array:
The pseudo assembly will look like the following:
The loop body takes 2 ticks. If we double the ALU number and unroll the loop body we will get the following situation:
Now we can see that the loop body takes 3 ticks. It is possible to make parallel load but the calculation itself can no be paralleled because its result depends on the previous loop iteration. So we do no double the performance with doubling the number of ALUs.
Now lets look to another example - the sum of two vectors:
Let us look an the pseudo assembly:
We count the body in 4 ticks. What happens if we double the number of ALUs? In this case we do not have dependencies on the previous calculations. So we can unroll the body of the loop and have the following code:
We still have 4 ticks but in the 4 ticks we count 2 loop iterations. So we can say that doubling the ALU numbers doubled our performance.
These simple examples only illustrate the idea that the instruction level parallelism depends on the particular algorithm and just doubling the ALUs may ton lead to doubling the performance.
In more complex cases VLIW systems have to implement complex optimizing compiler that can do optimizations that non-VLIW systems implement in the hardware. In some cases it works better in some - worse.