Problem Set Solutions/Advanced Caches/Problem20

We modified the code from Problem 3, by adding a non-binding pref(etch) instruction that sends its address to the cache but always completes in one cycle. If the pref instruction is a cache miss, the requested block is requested from memory, the cache line for it is reserved in the cache, and the block arrives into that line 49 cycles after the prefetch instruction was executed. The cache is non-blocking and can handle a miss under any number of prefetch misses, i.e. the cache can service other requests (including starting a block fetch for another miss) even when it is already servicing many misses.

Note that the processor still completes instructions one by one. A prefetch completes when it has sent a request to the cache, but a load completes when it got the value. Thus the cache can get a load miss while one or more prefetches are still fetching data, but it will not move on to other instructions on a load miss and, as a result, once a load miss occurs there will be no new misses (or for that matter any other instructions) until the load is completed.

add $2,$0,$0    ; Initialize $2 to zero
loop:   bnez $4,end ; Finish if $4 is zero
    pref 0($5)  ; Send prefetch to cache
    lw $3,0($5) ; Load array element (note: it’s a 4-byte word)
    add $2,$2,$3    ; Add to sum
    addi $5,$5,4    ; Point to next element (each element is 4 bytes)
addi $4,$4,-1  ; Decrement element count
    b loop      ; Go to next iteration
end:

Unfortunately, this code is broken – it issues a prefetch for data that will be needed in the very next cycle, so if the lw instruction is a miss, the prefetch reduces it’s the total latency of this miss by only 1 cycle. To prefetch data that will be needed in the future, the constant used in the address computation can be modified, e.g. pref 4($5) would prefetch for the next iteration.

If the prefetch constant is changed to 40, what is the number of cycles needed to execute each iteration (the very first, the one after that, etc.)?

Solution:

The prefetch now gets data for 10 iterations ahead. So the first eight iterations there is no benefit from prefetching and we have 7+50, 7, 7, 7, 7+50, 7, 7, 7 cycles for them (first and fifth iteration have full misses). The first prefetched data item is in the third block (third element in that block) but the whole block is prefetched. By the time we get to the third block (in iteration 9), the prefetch from iteration 1 has already brought the block into the cache. Thus we have no misses for that block (iterations 9 through 12 are 7 cycles each). When we get to iteration 13 (first to access the fourth block), we are accessing a block that was first prefetched in iteration 3, which was more than 49 cycles ago and there is again no miss for that block (7,7, 7, 7 for iterations 13 through 16). For iteration 17, the block was prefetched in iteration 7. There have been no misses since then, so there were 71 cycles between the prefetch and load, thus the data is already in the cache and there is no miss. In fact, there will be no more misses until the end of the program, so overall we have 7+50 cycles for iterations 1 and 5, and all other iterations take 7 cycles each.