We modified the code from Problem 3, by adding a non-binding pref(etch) instruction that sends its address to the cache but always completes in one cycle. If the pref instruction is a cache miss, the requested block is requested from memory, the cache line for it is reserved in the cache, and the block arrives into that line 49 cycles after the prefetch instruction was executed. The cache is non-blocking and can handle a miss under any number of prefetch misses, i.e. the cache can service other requests (including starting a block fetch for another miss) even when it is already servicing many misses.
Note that the processor still completes instructions one by one. A prefetch completes when it has sent a request to the cache, but a load completes when it got the value. Thus the cache can get a load miss while one or more prefetches are still fetching data, but it will not move on to other instructions on a load miss and, as a result, once a load miss occurs there will be no new misses (or for that matter any other instructions) until the load is completed.
add $2,$0,$0 ; Initialize $2 to zero loop: bnez $4,end ; Finish if $4 is zero pref 0($5) ; Send prefetch to cache lw $3,0($5) ; Load array element (note: it’s a 4-byte word) add $2,$2,$3 ; Add to sum addi $5,$5,4 ; Point to next element (each element is 4 bytes) addi $4,$4,-1 ; Decrement element count b loop ; Go to next iteration end:
Unfortunately, this code is broken – it issues a prefetch for data that will be needed in the very next cycle, so if the lw instruction is a miss, the prefetch reduces it’s the total latency of this miss by only 1 cycle. To prefetch data that will be needed in the future, the constant used in the address computation can be modified, e.g. pref 4($5) would prefetch for the next iteration.
If the prefetch constant is changed to 8, what is the number of cycles needed to execute each iteration (the very first, the one after that, etc.)?
The first four iterations behave the same as in problem 18 (full miss, hit, hit ,hit). For the fifth iteration, the load gets data 7 cycles earlier than in 4.b (prefetch executed two iterations earlier) so the penalty is only 34 cycles. Thus, from the fifth iteration we have a four-iteration pattern of 7+34 cycles, then 7, 7, and 7