VLIW - Instructon width performance increase

52 views Asked by At

Would doubling the amount of instructions in a VLIW allow for a processor to achieve double the performance since it can execute twice as many operations in parallel?

1

There are 1 answers

0
alexanius On

The answer depends on type of the calculations. Let us say that we have only one ALU on our machine. Imagine we have code that counts sum of an array:

for(int i = 0; i < len; i++)
{
  sum += arr[i]
}

The pseudo assembly will look like the following:

; tick 0:
    LD arr[i] -> %r0    ; load value from memory to register on ALU0
; tick 1:
    ADD sum, %r0 -> sum ; increment sum value                on ALU0

The loop body takes 2 ticks. If we double the ALU number and unroll the loop body we will get the following situation:

; tick 0:
    LD arr[i] -> %r0    ; load value from memory to register on ALU0
    LD arr[i+1] -> %r1  ; load value from memory to register on ALU1
; tick 1:
    ADD sum, %r0 -> sum ; increment sum value                on ALU0
; tick 2:
    ADD sum, %r1 -> sum ; increment sum value                on ALU0

Now we can see that the loop body takes 3 ticks. It is possible to make parallel load but the calculation itself can no be paralleled because its result depends on the previous loop iteration. So we do no double the performance with doubling the number of ALUs.

Now lets look to another example - the sum of two vectors:

for(int i = 0; i < len; i++)
{
  c[i] = a[i] + b[i]
}

Let us look an the pseudo assembly:

; tick 0:
    LD a[i] -> %r0      ; load value a[i]     on ALU0
; tick 1:
    LD b[i] -> %r1      ; load value b[i]     on ALU0
; tick 2:
    ADD %r0, %r1 -> %r2 ; add values          on ALU0
; tick 3:
    ST c[i] <- %r2      ; store value to c[i] on ALU0

We count the body in 4 ticks. What happens if we double the number of ALUs? In this case we do not have dependencies on the previous calculations. So we can unroll the body of the loop and have the following code:

; tick 0:
    LD a[i] -> %r0      ; load value a[i]     on ALU0
    LD b[i] -> %r1      ; load value b[i]     on ALU1
; tick 1:
    LD a[i] -> %r0      ; load value a[i]     on ALU0
    LD b[i] -> %r1      ; load value b[i]     on ALU1
; tick 2:
    ADD %r0, %r1 -> %r2 ; add values          on ALU0
    ADD %r0, %r1 -> %r2 ; add values          on ALU1
; tick 3:
    ST c[i] <- %r2      ; store value to c[i] on ALU0
    ST c[i] <- %r2      ; store value to c[i] on ALU1

We still have 4 ticks but in the 4 ticks we count 2 loop iterations. So we can say that doubling the ALU numbers doubled our performance.

These simple examples only illustrate the idea that the instruction level parallelism depends on the particular algorithm and just doubling the ALUs may ton lead to doubling the performance.

In more complex cases VLIW systems have to implement complex optimizing compiler that can do optimizations that non-VLIW systems implement in the hardware. In some cases it works better in some - worse.