This project is intended to help you understand the performance of multiple-issue out-of-order processors.
Part 1: Performance impact of caches
Now we can change some parameters of the simulated machine to see how that affects its performance. This is done by changing the configuration file. The processors (cores) are specified in the “cpucore” parameter near the beginning of the file. In this case, the file specifies that the machine has 256 identical cores (numbered 0 through 255), and that the core is described in section [issueX]. Going to section [issueX], we see that simulated cores has a lot of parameters, among which we see that the clock frequency is set at 5GHz, the core is an out-of-order core (inorder set to false) that it fetches and issues 4 instructions per cycle (fetchWidth, issueWidth), can retire 5 instructions per cycle (retireWidth), has a branch predictor described in the [BPredIssueX] section, fetches instructions from a structure called “IL1” described in the [IMemory] section (this is specified by the instrSource parameter, and reads/writes data from a structure called “DL1” described in the [DMemory] section.
In this part of this project, we will be modifying the L1 data cache, so let’s take a look at the [DMemory] section. It says that the structure the processor gets data from is of type “smpcache” (it’s a cache that can work in a multiprocessor, as we will see in later projects), which can store 32KBytes of data (size parameter), is 4-way set associative (assoc parameter), has a 32-byte block/line size (bsize parameter), is a write-back cache (writePolicy), uses LRU replacement policy, has two ports with port occupancy of 2 cycles (can handle two accesses every two cycles), has a 2-cycle hit time, takes 2 cycles to detect a miss, and if there is a miss requests data from the L1L2D device described in the [L1L2DBus] section. This device is a data bus connecting the L1 and L2 cache, and the L2 cache is described in the [L2Cache] section.
Now let’s change some cache parameters and see how they affect performance. Before we make any changes to the cmp.conf file, we should save the original so we can restore the default configuration later. In general, you should be very careful about ensuring that you have the correct configuration. The values for one thing (e.g. L1 cache) can affect what happens in other things (e.g. L2 cache), so you should restore the default parameters after completing this (Part 3) or the project.
We see (from the report in Part 1) that the miss rate in the 32KB DL1 cache is low (<1%), so increasing this cache size won’t make much of a difference. Instead, let’s reduce its size to 8KB and re-simulate crafty (using the same command line as in Part 1).
What is the miss rate of this 8KB 4-way set-associative DL1 cache? What is the overall speedup achieved by replacing this 8KB cache with a 32KB cache? Why does the simulation take almost twice as long to simulate execution with a 8KB cache than it does for the default 32KB cache?
Now reduce the associativity of this 8KB cache to 1 (direct-mapped). What is the miss rate of this cache? What is the overall speedup achieved by going from direct-mapped to a 4-way set-associative cache?
Because a direct-mapped cache should be faster, reduce port occupancy, hit time, and miss detection time from 2 cycles to 1 cycle for our 8KB, direct-mapped cache. What is the speedup of using this faster direct-mapped cache instead of the slower 4-way set-associative cache?
Let’s keep the fast 8KB direct-mapped DL1 cache, but eliminate the L2 cache. To do this, connect the L1L2DBus directly to the MemoryBus (comment out the lowerLevel=”L2Cache L2” line and uncomment the lowerLevel=”MemoryBus MemoryBus” line in the [L1L2DBus] section). What is the speedup achieved by having the L2 cache?
Submit the four report files in T-Square, but first rename them to sesc_crafty.Part3A, sesc_crafty.Part3B, sesc_crafty.Part3C, and sesc_crafty.Part3D.
Part 2: Changing the simulated cache
The cache we are simulating is a write-back cache, and keeps a dirty bit for each line to avoid unnecessary write-backs. Now we will explore what happens when there is no dirty bit and all blocks we replace from the cache require a write-back. Unfortunately, the simulator has no configuration option to prevent use of dirty bits – it assumes that a write-back cache will always have and use a dirty bit. As a result, we will have to change the code of the simulator. The source file which implements the ‘smpcache’ (used for our L1 cache) is in SMPCache.h and SMCache.cpp in the sesc/src/libsmp/ directory. The state kept for each cache line in this cache is described in SMPCacheState.h (in the same directory). The function that checks if the dirty bit is set is isDirty() in SMPCacheState, and it is used in SMPCache.cpp several times. Changing this code to completely eliminate the dirty bit from the code is hard, but the easy way to achieve the same effect is to change the code to ignore the dirty bit and write back the replaced block regardless of the value of the dirty bit. Make this change and then:
Simulate with the same configuration we used in Part 3B (default configuration, but with an 8KB DL1 data cache). What is the speedup achieved in this configuration by removing the dirty bit?
How does the total number of accesses to the L2 cache change in Part 4A when the dirty bit is not used?
Note: Because report.pl does not provide summary statistics on the L2 cache, you will have to directly examine the report file generated by SESC. This file begins with a copy of the configuration that was used, then reports how many events of each kind were observed in each part of the processor. Events in the DL1 cache of processor zero (the one running the application) are reported in lines that start with “P(0)_DL1:”. The number of blocks requested from the L2 cache is reported as lineFill, and the number of write-backs to the L2 cache is reported as writeBack. The total number of accesses to the L2 cache is, obviously, the sum of these two.
Keeping the DL1 cache size at 8KB, change the L2 cache to reduce the number of ports (numPorts) to 1, the hit latency (hitDelay) to 18, and prevent pipelining of L2 cache hits (portOccp=18, same as hit latency). What is the speedup achieved by removing the dirty bit in this configuration? Note: You will have to run two new simulations for this, one with the original SESC code and and the other with your changed code that disables the dirty bit.
Where does the dirty bit matter more – in a configuration used in part A) or the one used in part C)? Why does it matter more that configuration?
Submit the three report files in T-square, but first rename them sesc_crafty.Part4A, sesc_crafty.Part4C-HD (has dirty bit), and sesc_crafty.Part4C-ND (no dirty bit). Also submit the SESC source files you modified to prevent use of the dirty bit in SMPCache. Only submit the file or files that you have modified.