We modified the code from Problem 3, by adding a non-binding pref(etch) instruction that sends its address to the cache but always completes in one cycle. If the pref instruction is a cache miss, the requested block is requested from memory, the cache line for it is reserved in the cache, and the block arrives into that line 49 cycles after the prefetch instruction was executed. The cache is non-blocking and can handle a miss under any number of prefetch misses, i.e. the cache can service other requests (including starting a block fetch for another miss) even when it is already servicing many misses.
Note that the processor still completes instructions one by one. A prefetch completes when it has sent a request to the cache, but a load completes when it got the value. Thus the cache can get a load miss while one or more prefetches are still fetching data, but it will not move on to other instructions on a load miss and, as a result, once a load miss occurs there will be no new misses (or for that matter any other instructions) until the load is completed.
add $2,$0,$0 ; Initialize $2 to zero loop: bnez $4,end ; Finish if $4 is zero pref 0($5) ; Send prefetch to cache lw $3,0($5) ; Load array element (note: it’s a 4-byte word) add $2,$2,$3 ; Add to sum addi $5,$5,4 ; Point to next element (each element is 4 bytes) addi $4,$4,-1 ; Decrement element count b loop ; Go to next iteration end:
Unfortunately, this code is broken – it issues a prefetch for data that will be needed in the very next cycle, so if the lw instruction is a miss, the prefetch reduces it’s the total latency of this miss by only 1 cycle. To prefetch data that will be needed in the future, the constant used in the address computation can be modified, e.g. pref 4($5) would prefetch for the next iteration.
If the prefetch constant is changed to 4, what is the number of cycles needed to execute each iteration (the very first, the one after that, etc.)?
The first iteration executes in 7+50 cycles (with full miss penalty) because no prefetch was issued for the first block [-2 points for getting this wrong]. In the next three iterations the load is a cache hit (accesses the same block that was fetched for the first iteration), so they are executed in 7 cycles each. The fourth iteration issues a prefetch for the fifth one. The load in the fifth iteration is a miss, but has only a 41-cycle penalty (penalty is 7 cycles less than in 4.a because the prefetch was executed 7 cycles earlier), so the fifth iteration executes in 7+41 cycles. The sixth, seventh, and eighth iterations are 7 cycles each (no misses), then the pattern repeats (7+41, then 7, 7, 7, then 7+41, 7, 7, 7, etc.)