We modified the code from Problem 3, by adding a non-binding pref(etch) instruction that sends its address to the cache but always completes in one cycle. If the pref instruction is a cache miss, the requested block is requested from memory, the cache line for it is reserved in the cache, and the block arrives into that line 49 cycles after the prefetch instruction was executed. The cache is non-blocking and can handle a miss under any number of prefetch misses, i.e. the cache can service other requests (including starting a block fetch for another miss) even when it is already servicing many misses.
Note that the processor still completes instructions one by one. A prefetch completes when it has sent a request to the cache, but a load completes when it got the value. Thus the cache can get a load miss while one or more prefetches are still fetching data, but it will not move on to other instructions on a load miss and, as a result, once a load miss occurs there will be no new misses (or for that matter any other instructions) until the load is completed.
add $2,$0,$0 ; Initialize $2 to zero loop: bnez $4,end ; Finish if $4 is zero pref 0($5) ; Send prefetch to cache lw $3,0($5) ; Load array element (note: it’s a 4-byte word) add $2,$2,$3 ; Add to sum addi $5,$5,4 ; Point to next element (each element is 4 bytes) addi $4,$4,-1 ; Decrement element count b loop ; Go to next iteration end:
Unfortunately, this code is broken – it issues a prefetch for data that will be needed in the very next cycle, so if the lw instruction is a miss, the prefetch reduces it’s the total latency of this miss by only 1 cycle. To prefetch data that will be needed in the future, the constant used in the address computation can be modified, e.g. pref 4($5) would prefetch for the next iteration.
How many cycles does it take to execute this broken code as-is (prefetch uses 0($5) as the address)?
Now we have seven instructions in the loop, so we execute a total of 3586 (1+7512+1) instructions. The prefetch is executed only one cycle before the load for the same value, so if the load misses in the cache the prefetch has already requested the data and it will arrive 49 cycles after the prefetch is executed. So instead of taking 51 cycle (1 + miss penalty) the load completes in only 49 cycles (so the penalty is only 48 cycles now) [penalty of 49 is also OK, assuming that another cycle is needed after the line arrives to actually read it. Thus we now need 9730 (3586+12848) cycles, which is more than in 3.a). Prefetching made things worse because prefetch instructions take cycles to execute and are not executed early enough to eliminate a large part of miss penalty.