Sample Project Description

This project is intended to help you understand multi- and many-core performance. To complete this project, you should know how to setup the simulator, cross-compile an application, run a simulation, and examine results. Projects 1 and 2 contain instructions on how to accomplish these tasks. This project will be a lot easier to complete successfully if you have successfully completed Project 2. It is to your advantage to start early. Each part of this project assignment specifies what should be included in your report to get points for that part of the project. As explained in the course rules, this is an individual project: no collaboration with other students or anyone else is allowed.

What to submit in T-Square:

You will submit a report (as a PDF file) and additional files (see each question) for this project.

Helpful hints and clarifications:

Read the entire assignment carefully before you begin working on any part of it. Knowing what you will need to do for other parts of the assignment may (and probably will) help you plan your work and organize your modifications to the simulator’s code.

In this assignment, we are only looking at data L1 caches, e.g. if the question is about cache misses, it refers to cache misses in data L1 caches. Furthermore, “cache miss” refers to “real” cache misses. What the simulator reports as “half-miss” can be considered to be a cache hit for the purposes of reporting cache misses. Unless otherwise specified, always use the default configuration file (the one that came with the simulator), not the modified configuration file from some previous project or another part of this project. If a part of this project calls for a modification to the configuration file, you should start with the default configuration and just modify the parameters you are asked to modify.

Part 1 [60 points]: Improving many-core performance

In this part of the project, you will change the code in game.c (and only game.c) to try and improve the parallel performance of the gol application. You will be submitting the game.c file for grading, and it should correctly compile (and run) after it is compiled together with (unmodified) main.c and game.h you were given, using the (unmodified) Makefile you were given.

The goal of your changes will be to:

  1. Compile cleanly. Your code should compile cleanly (without any errors or warnings), using the main.c, game.h, and the Makefile you were given, for both native and SESC compilation. No points will be earned if your code compiles cleanly, but you will receive zero points for Part 1 if your code does not compile cleanly – to earn any points in Part 1, you (and we) should be able to run your code.

  2. Same code for all platforms. You are allowed to use G99 and GNU extensions in GCC 4.1.1 and 4.2.2 (the code is compiled with –std:gnu99), including GCC-specific attributes and other arcane features, as long as your code still compiles cleanly and runs correctly both for native x86 and for simulated MIPS execution. However, you are not allowed to use any pre-processor directives that result in compiling different code on different ISAs, version of the system, versions of GCC compiler, etc. For example, you cannot use inline assembler code, assign variables to specific registers, use library or system calls that are not supported either natively on our servers (cc-allison, cc-bonnett, cc-irwin) or in SESC, etc. As with A), no points will be earned for following this, but a substantial percentage (depending on the severity of the infraction) of the points you earn in Part 1 of this project will be deducted if you fail to follow this.

  3. Same code for all values of nthreads, gens, xsize, ysize, etc. The same code should execute regardless of the number of threads, number of generations, size of the problem, etc. You should not have different code for different values of one (or more) of these parameters and then select which code to execute based on the value of the parameter. As with A) and B), no points will be earned for following this rule, but you will lose points (how many depends on the magnitude of the infraction) for not following this rule.

  4. Retain correctness. The code you were given works correctly (but not too efficiently) with different input files, problem sizes, and numbers of threads (including non-power-of-two values for the number of threads). The code you produce should produce exactly the same output file as would the code you were given.

    We will test correctness of your code not only using inputs we provided to you, but also using additional input files, numbers of threads, etc. As with A), B), and C), no points will be earned by having your code work correctly, but we will perform 200 test runs with different input files and parameters (number of threads and generations) and you will lose 1 point for each test run in which your code fails to produce correct results.

    Extra credit:

    If you find a mistake in the game.c you were given, report that mistake by sending an e-mail to the professor together with a short explanation of how this error can be manifested (i.e. how it can produce incorrect outputs). For each legitimate problem in our game.c, the first student to report it will receive 2 bonus points. To discourage students from sending in too many illegitimate error reports (in hope that one of them might turn out to be a real error), an error report will be considered illegitimate if it does not provide a reasonable explanation of the error. You will lose 0.5 points for each illegitimate error you report.

  5. Improve performance in single-core execution [20 points].

    In your report, specify the SimTime (execution time on the simulated machine) that your code achieves in a single-threaded run (-n 1) using with 10 generations (the first simulation of the three that are performed when you do “make sim_one”, see Makefile for the exact parameters), with the default simulator configuration. The 20 points allocated to Part 1.E. will be assigned according to where the performance of you code stands between what you were given and our own solution (which, for obvious reasons, will not be given to you). We will compute the difference in execution time between your solution and the code you were given, do the same for our solution, divide your solution’s difference by our solution’s difference, and multiply that result with 20 to get the number of points for Part 1.E. If the given code’s execution time is 10ms, your code’s execution time is 4ms, and our code’s execution time is 3ms, the differences are 6ms for your code and 7ms for our code, and you would get 20*(6/7)=17 points here. If this points computation results in negative points (i.e. if your code is slower than the given code), these points will be deducted from any points you earn in Part 1.F.

  6. Improve performance of many-core execution [40 points]

    In your report, specify the SimTime (execution time on the simulated machine) that your code achieves in a 16-threaded run using with 10 generations (the first simulation of the three that are performed when you do “make sim_many”, see Makefile for the exact parameters), with the default simulator configuration. The number of points you will be given for Part 1.E will be computed by simply multiplying the parallel efficiency of your code by 44. E.g. if your 1-threaded execution takes 10ms and your 16-threaded execution takes 1ms, the parallel speedup is 10 and the efficiency is 10/16, so you would get 44*(10/16)=27.5 points. Note that perfect parallel efficiency would give you 44 points here, but perfect parallel efficiency is very hard to achieve – this way we allow you to get all 40 points even if your parallel efficiency is only about 0.91. Warning: We will use the game.c you submit to check the SimTime you report here. You will receive zero points for Part3.E if the SimTime you provide in your report is fraudulent.

You can use your simulator code from Part 2 to figure out if you can improve performance by reducing the number of misses. You are allowed to use the fact that the cache block size in the default configuration of the simulator is 32 bytes, but if you do this you should define a constant:

define cbsize 32

and then use cbsize in your code. Your code should work correctly and your cache-block-size-related optimizations should work for any value of cbsize (1..maxint).

You are allowed to change anything you want in game.c, as long as the changes satisfy the conditions specified above. For example, you may need to introduce new variables into the targs structure (to pass additional values to threads), change how the cell arrays are organized or allocated, change how the threads are created (as long as you only create up to <nthreads>-1 additional threads).

Keep in mind that, if your code produces wrong outputs in a native execution, it is likely to produce wrong outputs in a simulated execution. Therefore, after each code change you can test your code with native execution before trying simulated execution (which is much more time-consuming). Also keep in mind that most of the performance problems in parallel execution become more severe when more cores are used. Therefore, it makes sense to first work on getting good results with fewer threads (e.g. 2 or 4), as the simulation speed is much better when simulating fewer threads.

Warning: Start early to avoid running out of time, you will need to modify the code of the gol application substantially and test it (possibly with many simulation and/or native runs to debug your game.c code).

Part 2 [40 points]: Finding the number of required coherence misses

In Project 2, you have identified, for each dynamic load instruction, where the data block it is accessing was last written, and you have also counted such dynamic instances for each static load instruction. In this project, we will make this work a bit more useful for figuring out the performance of parallel applications.

Of all the “get_data” accesses from Project 2, the ones that can cause cache misses are those that occur right after the block is written by another processor. E.g. if processor A has written to a block, and processors B and C subsequently do 10 load accesses each to that block, the very first access on B and on C will be a miss (if the block was in B’s or C’s cache, it was invalidated when A wrote to it). Your task for this Part 3 is to separate what we counted in “get_data” (reading a block previously written by another processor) category from Project 2 into “need_miss” (first read of this block on this core since it was written), and “had_miss” (we read this block already since it was last written).

  1. In the given code for Part 1, what are the top ten static load instructions according to the number of “need_miss” accesses?
  2. For each top-ten static instruction from A), does this instruction come from the library (if so, which function) or from game.c (if so, which part of the code)?
  3. Repeat A) for your submitted game.c code from Part 1. Is there a significant difference? If yes, explain what did you do to cause the change. If no, explain why you couldn’t (or didn’t try to) reduce the number of these potential misses.
  4. For each of the top-ten static instructions from A), what is the number of actual cache misses that this instruction has in the cache and what percent of these misses is accounted for by “need_miss” accesses (static instruction’s “need_miss” count divided by its actual total miss count)?
  5. Are there any “had_miss” occurrences where that access is still a miss in the cache. Explain why this is happening.
  6. Generate and submit a patch file (sescchanges.patch) just like you did for Part 4.E in Project 2.