hpca Â»

```
0x110 li R2, 0 ; v = 0
0x114 li R3, 16 ; Loop bound for LoopI
0x118 li R4, 0 ; i = 0
LoopI:
0x11c beq R4, R3, EndLoopI ; Exit LoopI if i==16
0x120 li R5, 0 ; j = 0
LoopJ:
0x124 beq R5, R3, EndLoopJ ; Exit LoopJ if j==16
0x128 add R6, R5, R4 ; j + i
0x12c andi R6, R6, 3 ; (j+i) % 4
0x130 bne R6, R0, EndIf ; Skip if (j+i)%4 != 0
0x134 add R2, R2, R5 ; v += j
EndIf:
0x138 addi R5, R5, 1 ; j++
0x13c beq R0, R0, LoopJ ; Go back to LoopJ
EndLoopJ:
0x140 addi R4, R4, 1 ; i++
0x144 beq R0, R0, LoopI ; Go back to LoopI
EndLoopI:
```

For this code, determine the *exact* number of branch mispredictions that occur during its execution (show your work), if we use an always-taken predictor.

Instructor notes: For the given MIPS code, note that R0 (register 0) in the MIPS ISA always has a value of zero. Also note that MIPS instructions always start at word-aligned addresses, so the two least significant bits of the instructionâ€™s address are always zero. As a result, the two least significant bits are never used to index branch predictor tables. Instead, these bits are ignored and branch predictor table is indexed using the least significant bits of the instruction address that remain after the two least significant bits are removed.

For the code from Problem 1, determine the *exact* number of branch mispredictions that occur if we use a one-bit predictor with 8 entries. All predictor entries start out with zero values.

Mispredictions from branches 0x124 and 0x144?

Mispredictions from branch 0x130?

Mispredictions from branches 0x11c and 0x13c?

For the code from Problem 1, determine the *exact* number of branch mispredictions that occur if we use a local history predictor with 8 entries. Each entry has a 2-bit local history and four 2-bit counters. All history bits and 2-bit counters start out with zero values. For each 2-bit counter, use the 0-1-2-3/3-2-1-0 counting method from the lectures, not the 0-1-3/3-2-0 method from the textbook.

Consider an unpipelined processor. Assume that it has a 1-ns clock cycle and that it uses 4 cycles for ALU operations and 5 cycles for branches and 4 cycles for memory operations. Assume that the relative frequencies of these operations are 50%, 35%, and 15% respectively.

Suppose that due to clock skew and set up, pipelining the processor adds 0.15ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline?