Problem Set Solutions/Advanced Caches/Problem21

We modified the code from Problem 3, by adding a non-binding pref(etch) instruction that sends its address to the cache but always completes in one cycle. If the pref instruction is a cache miss, the requested block is requested from memory, the cache line for it is reserved in the cache, and the block arrives into that line 49 cycles after the prefetch instruction was executed. The cache is non-blocking and can handle a miss under any number of prefetch misses, i.e. the cache can service other requests (including starting a block fetch for another miss) even when it is already servicing many misses.

Note that the processor still completes instructions one by one. A prefetch completes when it has sent a request to the cache, but a load completes when it got the value. Thus the cache can get a load miss while one or more prefetches are still fetching data, but it will not move on to other instructions on a load miss and, as a result, once a load miss occurs there will be no new misses (or for that matter any other instructions) until the load is completed.

add $2,$0,$0    ; Initialize $2 to zero
loop:   bnez $4,end ; Finish if $4 is zero
    pref 0($5)  ; Send prefetch to cache
    lw $3,0($5) ; Load array element (note: it’s a 4-byte word)
    add $2,$2,$3    ; Add to sum
    addi $5,$5,4    ; Point to next element (each element is 4 bytes)
addi $4,$4,-1  ; Decrement element count
    b loop      ; Go to next iteration
end:

Unfortunately, this code is broken – it issues a prefetch for data that will be needed in the very next cycle, so if the lw instruction is a miss, the prefetch reduces it’s the total latency of this miss by only 1 cycle. To prefetch data that will be needed in the future, the constant used in the address computation can be modified, e.g. pref 4($5) would prefetch for the next iteration.

What is the optimal value (the one that minimizes the total execution time of this code) for the constant in the pref instruction in this code? Explain your answer.

Solution:

The miss in iteration 1 can only be helped if we use 0 for the constant, which is clearly bad (we save only a few cycles from each miss, and we have one miss for each block). If we prefetch between 4 and 7 iterations ahead, we will prevent a miss in iteration 5 – the very first prefetch will start getting the second block, and because of the miss in iteration 1 there is enough time for the block to arrive in time for iteration five. If there is no miss for iteration 5, however, iterations 2 through 7 each take only 7 cycles. To avoid a miss in iteration 9 (the first one to access the third block) we need the prefetch to occur at least 49 cycles earlier, and with 7-cycle iterations that means 7 iterations earlier. The only way we can avoid the miss in iteration 5 and all subsequent misses is if we prefetch exactly 7 iterations ahead, so the optimum value for the constant is 28 (7*4). With a value lower than that there will be misses throughout the loop because the prefetch does not occur early enough to prevent them, and with a value higher than that we will have a miss in iteration 5.