Problem 3. 20 Points – Compiler Optimizations for ILP
We have a VLIW processor with 16 registers and the following units:
1) A branch unit (for branch and jump operations), with a one-cycle latency.
2) A load/store (LW/SW operations) unit, with a six-cycle latency.
3) Two arithmetic units for all other operations, with a two-cycle latency.
Each instruction specifies what each of these four units should be doing in each cycle.
The processor does not check for dependences nor does it stall to avoid violating program
dependences – the program must be written so that all dependences are satisfied. For
example, when a load-containing instruction is executed, the next five instructions
cannot use that value. If we execute the following loop (each row represents one VLIW
instruction), which performs element-wise addition of two 60-element vectors (whose
address are in R0 in R1) into a third (address is in R2), where the end address of the result
vector is in R3, and each element is a 32-bit integer:
Note how we increment the vector points while waiting for loads to complete. Also note
how each operation reads its source registers in the first cycle of its execution, so we
can modify source registers in subsequent cycles. Each operation also writes its result
registers in the last cycle, so we can read the old value of the register until then. Overall,
this code takes ten cycles per element, for a total of 600 cycles.
We will unroll so three old iterations are now one iteration, and then have the compiler
schedule the instructions in that (new) loop. What does the new code look like (use only
as many instruction slots in the table below as you need) and how many cycles does the
entire 60-iteration (20 iterations after unrolling) loop take now?
Answer: ______ cycles
Your code for the unrolled loop: