Sample Projects/Sample Project 2

This project is intended to help you learn more about multi-core execution. You will submit a report for this project. To complete this project, you should know how to set up the simulator and run a simulation. You should also know how to compile a simple application using the cross-compiler. Project 0 has instructions on how to accomplish these tasks.

Each part of this project assignment specifies what should be included in your report and which additional files to submit in T-Square. Do not archive the report files when you submit them – each file submitted in T-Square should be uploaded separately, with the specified in this assignment.

As explained in the course rules, this is an individual project: no collaboration with other students or anyone else is allowed.

Part 1: Running a parallel application

Set up the simulator as explained in Project 1, and make sure that you are using the default configuration file (not the one modified for Project 1). This time, we will be compiling and running a parallel application called lu from the Splash-2 benchmark suite (which is often used to evaluate shared-memory parallel machines).

First, make an lu directory in your ~/sim directory, then copy the source code of the application there:

 cd ~/sim

 mkdir lu

 cd lu

 cp /CS6290/lu.c .

Now compile this application:

 mips-unknown-linux-gnu-gcc -O3 -g -static -fno-delayed-branch -fno-optimize-sibling-calls -msplit-addresses -mabi=32 -march=mips4 -o lu.mipseb lu.c -lm -lpthread

Note that we are using the same options as we did in Project 1, except that we are now telling the compiler to use the math library (-lm) and the pthread library (-lpthread) when linking the application.

Now that we have a MIPS executable file for this application, we can run some simulations. The lu application has two main parameters, -n <size> and –p <nthreads>, where <size> it determines the problem size and <nthreads> specifies how many threads to use. This application performs a LU decomposition of a <size>-by-<size> matrix using <nthreads> parallel threads.

To complete this part of the project, run the lu application in the simulator using 1, 2, 4, 8, and 16 processors (cores), using <size> of 128, then answer the following:

  1. What is the parallel speedup and parallel efficiency with 2, 4, 8, and then with 16 processors?

    Note: Parallel speedup is the speedup of parallel execution over the single-core execution with the same input size. Parallel efficiency is the parallel speedup divided by the number of threads used – ideally, the speedup would be equal to the number of threads used, so the efficiency would be 1.

  2. Note how parallel efficiency in A) gets lower as we use more processors. What do these results mean? Why is this happening?

  3. When we use two processors (-p 2) instead of one (-p 1), the IPC achieved by processor 0 (the first processor listed) got a bit lower. Why?

Submit simulation report files in T-Square (but rename them to sesc_lu.mipseb.Part1-p1, sesc_lu.mipseb.Part1-p2, sesc_lu.mipseb.Part1-p4, sesc_lu.mipseb.Part1-p8, and sesc_lu.mipseb.Part1-p16).

Part 2: Using a larger problem

Now repeat the simulations from Part 1, but this time using <size> of 256, and you will only need simulations with 1, 4, and 16 threads.

  1. What is the parallel speedup and parallel efficiency with 4 and with 16 processors when this larger problem size is used?

  2. Compared to your results for the same number of processors from Part 1.A), why is parallel efficiency now better?

  3. Look at the report for the 16-threaded execution and note how processor 0 executed more than its fair share of all instructions. Why is this so?

Submit simulation report files in T-Square (but rename them to sesc_lu.mipseb.Part2-p1, sesc_lu.mipseb.Part2-p4, and sesc_lu.mipseb.Part2-p16).

Part 3: Cache miss behavior

In this part of Project 2, we will be focusing on the number of read misses in the DL1 (Data L1) cache. In the report file generated by the simulator (sesc_lu.mipseb.something), the number of cache read misses that occur in each DL1 cache (one per processor core) is reported in lines that begin with “P(0)_DL1:readMiss=” for Processor 0, “P(1)_DL1:readMiss=” for Processor 1, etc.

  1. What is the total number of DL1 read misses that occur in simulations from Part 1 (<size> is 128) with 1, 4, and 16 threads/processors used? Why does the number of these misses decrease as we go from one to four threads/processors? Why does the number of these misses increase as we go from four to sixteen threads/processors?

  2. What is the total number of DL1 read misses that occur in simulations from Part 2 (<size> is 256) with 1, 4, and 16 threads/processors used? Why does the number of these misses now decrease as we go from one to four and also as we go from four to sixteen threads/processors? The misses reported as “readMiss” by the simulator are those that occur when a load instruction does not find data in the DL1 cache, so the DL1 cache sends a request to the L2 cache to get that block. Another type of cache miss is one that occurs when the data is not found in the DL1 cache, but the data has already requested from the L2 cache (miss under a miss). The number of such misses is reported as “readHalfMiss”.

  3. Consider simulations with 1, 4, and 16 threads from Part 1. In each of these three simulations, what is the average number of (half) misses that occur under another (full) miss?

  4. Consider simulations with 1, 4, and 16 threads from Part 2. In each of these three simulations, what is the average number of (half) misses that occur under another (full) miss?

  5. Explain the difference between your answers for C) and for D)

Part 4: Identifying accesses to shared data

You task in this part of the project is to first break down DL1 cache accesses by static instruction – for each static load instruction (load instruction in the program), you should determine how many reads are caused by dynamic instances (executions of the same static instruction) of that particular static instruction.

You can then use the /CS6290/mipsrt/cross-tools/bin/mips-unknown-linux-gnu-addr2line tool (which is part of the cross-compilation toolchain) to find out which line of code each static instruction comes from. If the instruction comes from the C library (not from lu.c), the line number will be zero, and for those instructions you can use the objdump cross-tool (/CS6290/mipsrt/cross-tools/bin/mips-unknown-linux-gnu-objdump) to find out which function the instruction is in.

Once you have modified the simulator to count read events for each load instruction as described above, re-compile the lu application with fewer optimizations (cross-compile as described before, but with -O1 instead of -O3; this is needed because advanced optimizations can prevent addr2line from finding the correct line of code for a particular static instruction), re-run the same simulations used in Part 3.A), and for each of the three simulations answer the following:

  1. What are the top five static instructions (specify the instruction address, the corresponding line of source code, and underline the part of the source code line that specifies the accessed data item) according to the total number of cache reads caused by each instruction? How many cache reads (not just read misses) are caused by each of these five instructions? Next, you’ll need to determine for each static instruction what percentage of its dynamic occurrences (executions of the instruction) are 1) accessing a data block that has not yet been written by any processor (let’s call this category “new-data”), 2) accessing a data block whose most recent write came from the same processor (let’s call this category “own-data”), and 3) accessing a data block whose latest write came from another processor (let’s call this category “get-data”). Note that this classification depends only on the source of the data (where it was produced) for each read, without distinguishing read hits from read misses (or half-misses). Similarly, the classification only considers the last writer of the block, without regard for which caches the block is found at the time of the read access. Once you have modified the simulator to count these three categories of events for each load instruction as described above, re-run the same simulations used in Part 3.A) and for each of the three simulations answer the following:

  2. Show this breakdown (into new-data, own-data, and get-data) for each of the top-five static instructions you listed in Part 4.A).

  3. Are there (not just among top-five) static load instructions that never read data written by other processors (100% of reads are new-data or own-data, 0% is get-data)? How does the number of reads that comes from these instructions change when the number of threads is increased (from 1 to 4 to 16)?

  4. What do your results in B) and C) imply for parallel performance of this application?

Submit simulation report files in T-Square (renamed to sesc_lu.mipseb.Part4-p1, sesc_lu.mipseb.Part4-p4, and sesc_lu.mipseb.Part4-p16). Also submit in T-Square your lu.mipseb MIPS executable for Part 4 (the one created with -O1) and a sescchanges.patch file that should be created as follows:

Unpack the simulator (from /CS6290/sesc/tar/bz2) into another directory (e.g. ~/sim/sesc.old) and then get the differences between your modified source files and the old ones with:

diff -C 2 -r ~/sim/sesc/src/ ~/sim/sesc.old/src > sescchanges.patch

This creates a file that contains all the changes you made to the simulator’s source code, and allows us to re-create your changes by applying this patch to the original code of the simulator. Note: if you want to debug the simulator, you can uncomment the “DEBUG=1” option in the Make.defs file (in your runsim directory), do a “make clean”, and then “make” again. When you are done debugging, remember to re-make (“make clean” and then “make”) the simulator again (debug makes it much slower).

Note: You can modify the simulator to produce the breakdown needed for B) and then use the same data to answer A) - the total number of reads for each static instruction is equal to the sum of the three categories of reads for that instruction. However, it may be easier to first do A) and get your modified simulator code working before you make further code modifications that are needed for B). Hint: the simulator is a C++ application, and you can use the C++ standard template library to easily implement data structures (std::map and its cousins should be especially useful) that you will need to track the last writer of each block and to count different kinds of accesses for each static instruction. At the end of the simulation, you can either add your new data to the report file or simply print the data to the screen. Note that many different but correct implementations are possible – you can use any data structures and algorithms you want, but it is up to you to ensure that the results you obtain are correct.

Warning: Part 4 of this project will take much more time to do than entire Project 1. You should start early to avoid running out of time.