A dual-core processor has a separate L1 data cache for each core and a shared L2 cache. The L1 caches are kept coherent using the MSI coherence protocol. Each L1 cache is 4Kbytes in size and the L2 cache is 256Kbytes in size, and all caches have a 64-byte block size and are initially empty. The two cores need to initialize (zero out) a shared array that has 1024 words, where word size is 32 bits (4 bytes). This requires each core to perform a write to 512 words, and the programmer has chosen to use core 0 to initialize the even-numbered elements and core 1 to initialize odd-numbered elements using the following code (which will execute on both cores in parallel). In this code, register initially $4 contains the core number (0 for the one core and 1 for the other) and register $5 initially contains the address of the shared array.
ADDI $4,$4,$4 ; ADDI $4,$4,$4 ; $4 is now 0 on core 0 and 4 on core 1 ADD $5,$5,$4 ; On core 1, $5 now points to first odd-numbered element ADDI $6,$5,4096 ; Compute where the loop should end Loop: SW $0,0($5) ; Initialize an element ADDI $5,$5,8 ; Next element (8 b/c we do every other element) BNE $5,$6,Loop ; Loop back if not done with all of our elements
Assuming that the shared array entirely fits on one page, answer the following:
Describe the interleaving of accesses from these two threads that results in the highest number of cache misses. What is the total number of L1 cache misses for this interleaving?
If the array fits in 64 blocks, each cache block contains 16 elements of the array, and each thread accesses 8 elements from the first block, then 8 from the second block, etc. The worst-case scenario is that these accesses are completely interleaved – thread accesses element 0, then thread 1 accesses element 1, then thread 0 accesses element 2, etc. In this case, none of the accesses is a hit – after core 0 element 0, the write from core 1 invalidates that block from core 0’s cache, so the write to element 2 is again a miss (and it invalidates the block from core 1’s cache), etc. Overall, the number of cache misses in this worst-case scenario is 1024. If the array is not block-aligned, the worst-case scenario still has 1024 misses – for 63 blocks things happen exactly as described above, and for the first and last block we either have multiple elements in a block (in which case the two threads can invalidate each other’s blocks as described above) or only one element in a block (in which case the access to that element is a compulsory miss).