Problem Set Solutions/Mulitprocessing/Problem13

A dual-core processor has a separate L1 data cache for each core and a shared L2 cache. The L1 caches are kept coherent using the MSI coherence protocol. Each L1 cache is 4Kbytes in size and the L2 cache is 256Kbytes in size, and all caches have a 64-byte block size and are initially empty. The two cores need to initialize (zero out) a shared array that has 1024 words, where word size is 32 bits (4 bytes). This requires each core to perform a write to 512 words, and the programmer has chosen to use core 0 to initialize the even-numbered elements and core 1 to initialize odd-numbered elements using the following code (which will execute on both cores in parallel). In this code, register initially \$4 contains the core number (0 for the one core and 1 for the other) and register \$5 initially contains the address of the shared array.

ADDI \$4,\$4,\$4   ;
ADDI \$4,\$4,\$4   ; \$4 is now 0 on core 0 and 4 on core 1
ADD \$5,\$5,\$4        ; On core 1, \$5 now points to first odd-numbered element
ADDI \$6,\$5,4096 ; Compute where the loop should end
Loop:
SW \$0,0(\$5)     ; Initialize an element
ADDI \$5,\$5,8        ; Next element (8 b/c we do every other element)
BNE \$5,\$6,Loop  ; Loop back if not done with all of our elements

Assuming that the shared array entirely fits on one page, answer the following:

Change the above code to ensure that the number of cache misses is always as low as possible, regardless of the order in which the accesses from the two cores are interleaved? Hint: you will need to change which elements will be initialized by which core.

Solution:

To initialize elements in each block, at least one of the cores must write to the block and suffer a compulsory miss. The best possible scenario is if only one core accesses each block and suffers only the one compulsory miss on it. This best-case scenario is easiest achieved if one core initializes the first half of the array and the other initializes the second half. If the array is block-aligned, this results in exactly 64 misses (32 on one core and 32 on another). If the array is not block aligned, ideally we would have one core initialize the first half of the array plus any elements from the second half that are in the same block, leaving the other thread to initialize elements from the second half that do not share a cache block with elements form the second array. The following code assumes that the array is block-aligned.

BEQZ \$4,SkipAdd
ADDI \$5,\$5,2048     ; On core 1, \$5 points to second half of the array
SkipAdd:
ADDI \$6,\$5, 2048        ; Compute where the loop should end
Loop:
SW \$0,0(\$5)     ; Initialize an element
ADDI \$5,\$5,4        ; Next element
BNE \$5, \$6, Loop        ; Loop back if not done with all of our elements

If the array is not block aligned, additional code is needed to find the block boundary that is closest to the middle of the array and split the array between the two cores accordingly.