Problem 3. 20 Points – Compiler Optimizations for ILP

We have a VLIW processor with 16 registers and the following units:

1) A branch unit (for branch and jump operations), with a one-cycle latency.

2) A load/store (LW/SW operations) unit, with a six-cycle latency.

3) Two arithmetic units for all other operations, with a two-cycle latency.

Each instruction specifies what each of these four units should be doing in each cycle.

The processor does not check for dependences nor does it stall to avoid violating program

dependences – the program must be written so that all dependences are satisfied. For

example, when a load-containing instruction is executed, the next five instructions

cannot use that value. If we execute the following loop (each row represents one VLIW

instruction), which performs element-wise addition of two 60-element vectors (whose

address are in R0 in R1) into a third (address is in R2), where the end address of the result

vector is in R3, and each element is a 32-bit integer:

Screen Shot 2014-04-29 at 9.43.54 AM.png

Note how we increment the vector points while waiting for loads to complete. Also note

how each operation reads its source registers in the first cycle of its execution, so we

can modify source registers in subsequent cycles. Each operation also writes its result

registers in the last cycle, so we can read the old value of the register until then. Overall,

this code takes ten cycles per element, for a total of 600 cycles.

We will unroll so three old iterations are now one iteration, and then have the compiler

schedule the instructions in that (new) loop. What does the new code look like (use only

as many instruction slots in the table below as you need) and how many cycles does the

entire 60-iteration (20 iterations after unrolling) loop take now?

Answer: ______ cycles

Your code for the unrolled loop:

Screen Shot 2014-04-29 at 9.46.18 AM.png