# Problem Set Solutions/Mulitprocessing/Problem11

A dual-core processor has a separate L1 data cache for each core and a shared L2 cache. The L1 caches are kept coherent using the MSI coherence protocol. Each L1 cache is 4Kbytes in size and the L2 cache is 256Kbytes in size, and all caches have a 64-byte block size and are initially empty. The two cores need to initialize (zero out) a shared array that has 1024 words, where word size is 32 bits (4 bytes). This requires each core to perform a write to 512 words, and the programmer has chosen to use core 0 to initialize the even-numbered elements and core 1 to initialize odd-numbered elements using the following code (which will execute on both cores in parallel). In this code, register initially \$4 contains the core number (0 for the one core and 1 for the other) and register \$5 initially contains the address of the shared array.

``````ADDI \$4,\$4,\$4   ;
ADDI \$4,\$4,\$4   ; \$4 is now 0 on core 0 and 4 on core 1
ADD \$5,\$5,\$4        ; On core 1, \$5 now points to first odd-numbered element
ADDI \$6,\$5,4096 ; Compute where the loop should end
Loop:
SW \$0,0(\$5)     ; Initialize an element
ADDI \$5,\$5,8        ; Next element (8 b/c we do every other element)
BNE \$5,\$6,Loop  ; Loop back if not done with all of our elements
``````

Assuming that the shared array entirely fits on one page, answer the following:

Note that the number of coherence misses that will occur in L1 caches depends on how accesses from the two cores are interleaved in time. One way of interleaving these accesses is to do all accesses from core 0, and then all accesses from core 1. What is the total number of L1 misses for this interleaving?

Solution:

The entire array has 4096 bytes, so it covers 64 or 65 blocks (depending on whether the array begins right at the beginning of a block). Let’s assume that the array begins at cache block boundary, in which case the array fits exactly in 64 blocks. Because core 0 accesses every other word, it will access every block that stores the array so it will suffer 64 cache misses. Then core 1 also accesses every one of the 64 blocks, so it also suffers 64 misses. The total number of misses is 128.

If the array does not begin at a block boundary, its elements would be kept in 65 blocks, and the number of misses would either be 130 (each core accesses all the blocks) or 129 (if only one element ends up in a block by itself then onl one core accesses it so it has 65 misses and the other core has 64).