cs259 »

CS259 Lesson5

01 l Bug Categories

Welcome to the 5th lesson in our debugging course titled Reproducing Failures. In the past lessons we have seen a number of techniques to systematically hand down a failure cause by following back dependencies and applying the scientific method to choose between various possible origins until we find the defect. So far, however, we have assumed that the program fails in the lab and that somehow we would be able to actually access these earlier states or at least run appropriate experiments where we would be able to gather additional. Contrast this with a program failing in the field where all we know is that the program failed plus possibly a few hints from the execution, but all-in-all only a spottier record compared to what you get in the lab. This is where reproduction comes into play. You have to reproduce the failing run from the field in the lab such that you can actually do the debugging. Why do you need to do that? We need to be able to observe the run, because once we have reproduced a failure locally, we can observe and experiment at will. This, of course, is invaluable for isolating the defect. Second, we need to be able to check the fix. Only if we can reproduce a bug can we be sure that we have actually fixed it, namely, if the failure no longer occurs. Reproducing a bug can be way harder than fixing the bug. In programmer's jargon, bug's fall into four categories. First, there is the Bohr bug from Bohr's model of the atom. This is a repeatable bug that many manifests reliably under a possibly unknown but well-defined set of conditions. Next category is a Heisenbug from Heisenburg's uncertainty principle in quantum physics. This is a bug that disappears or alters its behavior when one attempts to probe or isolate it. The next one is the Mandelbug, coming from the Mandelbrot set. This is a bug whose underlying causes are so complex and obscure as to make its behavior appear chaotic or even nondeterministic. And last but not least there is the Schrödinbug, which is MIT jargon coming from Schrödinger's cat thought experiment in quantum physics. You know--the situation in which you have a cat in a box, which may be dead or which may be alive, but you don't know until you open the box. This is a bug in a program that doesn't manifest until someone who reads the source or who uses the program in an unusual way notices that the program never should have worked, at which point the program promptly stops working for everybody until it is fixed. All these bug categories refer to various levels of difficulty as it comes to reproducing the bugs. Here is an example of a bug that was hard to reproduce that I once encountered. I had a C program that crashed all the time. In an extremely simplified version this is what it looks like. If I ran this program normally, it would crash. The assertion would fail. If I ran it in a debugger, however, it worked just fine.

02 q Classify The Bug

Quiz: What kind of bug is this? Is this a Bohr bug, a Heisenbug, a Mandelbug, or a Schroedinbug? Over to you.

02 s Classify The Bug

Now for the answer--Is this a Bohr Bug, which is repeatable under under a well-defined set of conditions? Well, if you include a debugger in the conditions, it could be a Bohr Bug. Normally, we don't do that. Heisenbugs change as the program is being observed, and therefore, this is a perfect instance of a Heisenbug. Mandelbugs have very complex causes--not the case over here. A Schroedinbug appears as soon as somebody looks at this program, and it probably stops working--Not the case over here. I would love to see a Schroedinbug in practice. It's a very interesting theoretical concept. So, how does this bug come to be? What we have here is a C-specialty. This is an initialized variable. When in a C program, we read an uninitialized variable as in here. Then the behavior of the program is undefined according to the semantics of C and also C+ does follow. Since the behavior is undefined, crashing was a perfect option. When executed normally x would simply take whatever random value was on the stack. Therefore, the assertion would fail. However, the debugger would always set the entire stack to zero before execution, and therefore x would take a zero value. Therefore, the assertion would hold. You can even have such distinction between running normally and being run on a debugger on purpose. Some malware actively checks if it's being run in a debugger and then automatically turns itself off. So, if you try to diagnose on malware, the malware may be specifically set up to prevent all sorts of probing. If you suspect a Heisenbug, think what the debugger does differently than the normal execution, and observe the program execution by at least two independent means, say a debugger and logging output or two different debuggers or whatever you have in order to find out what's going on.

03 l The Many Facets of Input

Let's now go for a systematic reproduction process. First we need to reproduce the environment in which the bug occurs. Second we need to reproduce the steps required to reproduce the problem. Both of these can be abstracted to input. All your program does is dependent on its input. Therefore, if you can reproduce the input, you can also reproduce the execution. With that we already solved the problem. Programs are nothing but mathematical functions. If they have the same input, they will reproduce the same output, right? Why is there actually a problem of reproducing? That's the difference between theory and practice. In theory, there is no difference. In practice, there is, at least that's the theory. Indeed, in practice the concept of what makes input to a program is very variable. We distinguish a number of typical inputs, each coming up with its own issues. We do have static data, user interaction, time of day or general time, randomness, that is randomness on purpose. We have the operating environment. We have schedules, in particular for parallel programs. We have physical influences, and we do have debugging tools. All of these constitute input to the program, and each of these brings up its own issues as it comes to reproducing failures. Let's start with regular static data. In a batch-oriented program, these are the inputs that is the files of documents your program processes. Static data is the easiest to reproduce, because it can be handled as an individual entity outside of your program. If your program operates on files, all you need to do is to ship the data files. If it works on a database, then use tables instead. This includes all data that is under the control of the user. If it's under your control, you probably already know it. So, you care about the variable stuff. For one thing, these are documents under the user's control as well as configuration files, in which user's store their preferences or the settings for they system at hand. Obviously, if you have all the documents and the configuration files that your program operates upon, then it's fairly easy to reproduce the exact problem. However, there's a small catch with that--just a tiny, tiny catch. This tiny, tiny catch can become a big issue. Let me illustrate this with an example. Here again we have our remove html markup function, and here comes an input as one could find it in, say, a user's document. We can run the whole thing with this input and very easily reproduce the failure. In this case, the failing assertion.

05 p Reproducing User Interaction

In the good old days programs would read their input from data files only, which made reproduction an easy task. Modern programs use complex user interfaces, and these make it hard to observe and control the user's input. The standard approach to reproduce user interaction is to use a capture replay tool. A capture replay tool acts like a layer that interposes itself between the input, as provided by the user, and the actual program. These tools can operate in two modes. During recording mode all input coming from the user is passed down directly to the program as is all output from the program. So, the user interacts normally with the program as always. During the capture all input from the user, that is mouse clicks, keystrokes, and similar events, are recorded in the log file. During replay the program executes under control of the capture replay tool, getting its input from the previously recorded events and thereby ignoring regular user input. That is, all the events which are previously recorded and are passed onto the program just as if they were input by the user, and the program produces its normal output. Technically spoken, such tools realize capture replay by intercepting calls to library or system functions, which would normally provide user input. Let's demo this on an example. We have a program which takes commands. From a command, it processed them, Here is the function input command, which is supposed to accept a command. Process is a function, which then processes the command as entered. This is how such an input function would be implemented in Python. The function raw input outputs the prompt on the console and accepts a line that will then be returned by the function raw input and therefore input command. The idea we are having in here, however, does not support interaction with programs. So we simulate the whole thing by reading commands from an array instead. Here we have a list of commands. With every invocation of input command, the next command is returned from the array. We do so by increasing the index, command index, by one every time input command is called. The first time it returns open. The second time it returns save--The third time it returns quit. Here's the function that actually handles the command. All it does at this point is to print out the command, and if the command starts with q , such as the quit command, then the global resume variable is set to false, meaning that a y loop over here simply exits. Now we have to move the main function at the end such that it gets executed after all the functions are refined. Let's see whether this works properly--It does. Here we have simply the three commands printed out, and with the quit command the loop simply exits. Here's how a capture replay tool works. It replaces the original input command function by a new function, which wraps around the original input command function, calls it, also returns its original value. On top of that it also saves the commands in a log. In our example here, we're simply going to use this list here wherever our new command is to be appended. This is a simple programming exercise, and I'll leave it over to you.

05 s Reproducing User Interaction

Okay. Welcome back. Let's go and implement this thing. Since we're addressing a global variable, we need to declare it. All we do is we take the command that's being input in here and append it to our list. To see whether everything has been recorded properly, we print out saved commands at the end of the execution. Let's see how this works. Here again we do have our three commands as they're being processed. This is the recorded commands, namely: open, save, and quit. If you're running this outside of the web page-- that is, on a command line-- then you can actually uncomment this line and interact with the program itself. [The Infinite Abyss]

06 s Record and Replay

Again this is fairly easy to realize, because we already have a template up here. So again, we have an index to the list of saved commands, which we call saved command index. So, at this moment we are in replay mode. While we are in replay mode the original input command function should no longer be called. All of the commands should now come from the saved commands in here. Let's see how this works. We get the same set of commands now being read from the saved commands variable. You read a program that is outside of the webpage one would now go and store the saved commands content permanently, say in a file for recording and for replaying. One would read this from a file. For programs that take text based commands, recording and replaying input is fairly easy to realize. However, as it comes to reproducing graphical user interaction it becomes hard to decide what of events to record and replay. Suppose you have a graphical user interface for a web browser. You have a captured replay tool that interposes itself between the user and the program and stores all events coming from the user in an event log such that they can be later replayed. The problem is: What is the abstraction level you're going to chose for your events? You could, for instance, go and record mouse clicks to gather with the coordinates of the mouse pointer. This is fairly easy to record. So, for instance, you could go and click here on search at the position, say 100x and 300y. Or, you could enter some text while the cursor is in the position 200 and 50. Click here on the file menu or here down on this roll-down menu. Of course, all of these then go into the event log. The problem is that for replaying, you need to have the exact same position of every element at the very same position on the screen. Now suppose your screen resolution changes. or the font size changes. Then some of the elements will still be more or less in their original positions, but others won't. We knew that the mouse click, which previously activated a particular button on the screen, will no longer work Likewise, if the position of some elements change, then as well, the recorded events will no longer work. Also, timing can become an issue. Suppose you recorded a click on this drop-down menu. Half a second later, when the drop-down menu appeared, you clicked on one of the elements in here. So far, so good. This is what you recorded. Now when you're replaying, and in particular, when you are replaying something from a web page, it may be that there are delays, and the drop-down menu will not appear on time. Then you'd be clicking somewhere completely different when replaying the recorded events, which stipulate that after half a second you'd click in this very position. Finally, of course, the program itself may change. If the search button is moved from down here to up here, then again replaying the event log based on absolute coordinates will click in the wrong position.

07 l Recording GUI

In all of this, changes in size, changes in language, changes in speed, or changes in the program itself are all obstacles for record, replay, reproduction of graphical user elements. A better option is to have the GUI elements themselves do the recording. That is, when I'm clicking in here the search button knows it has been pressed, and what I am recording right now is not the fact that I clicked in this particular area of the screen, but what is recorded right now is that the search button has been pressed. Likewise, if we enter text in this field, what will be recorded is that in the URL GUI element, the text foo was entered. This way, if anything in the resolution changes or if the position changes, the events will still be tied to the GUI elements, and therefore, our recording becomes independent of the actual rendering on the screen. This of course is also true for rearrangements as long as the identifiers of the GUI elements stay the same, we'll still be able to replay earlier logs even though many details, in terms of the rendering, have changed. The best capture replay tools offer facilities to record and replay at the low levels. That is, mouse clicks and keystrokes, but also facilities to replay at higher levels. That is, directing some input specifically toward some element of the graphical user interface. For web pages, the Selenium Toolkit offers a nice mix of multiple abstraction layers. Check out what kind of tools are available for your specific environment and test them thoroughly before making a purchase.

08 q Risks

So now for a quiz: What are the risks of GUI capture replay tools? They may capture sensitive information, which should not go to developers. Recorded scripts may not be replayable in different environments. Different because of different local, different resolution, different font size, or different speed. Recorded scripts may be prone to change as the software or the GUI changes. Check all that apply. Over to you.

08 s Risks

Now for the answer: Well, GUI capture replay tools can capture sensitive information, just as sensitive information may be contained in files. If you enter a password, for instance, on a webpage, and this is recorded, this can even become a security issue. Next recorded scripts may not be replayable in different environments. Yes, we have seen that. Finally, recorded scripts may be prone to change as the software of the GUI changes, obviously. We also have seen that. All three apply.

09 p Record Calls

At this point we have seen how to reproduce static data simply by shipping the files over as the data is required by the program. We have seen how to record and replay user interaction. However, all of these are actually instances of the same class of input that is interaction between the program and its environment. The question is whether it would be possible to record and replay this interaction as a whole. A program interactions with its environment by means of function calls. We use function calls to access data. We use function calls to access user interaction or generally anything that comes form the environment. So, what we could do is to record and replay function calls. This way we would have a single mechanism to capture all sorts of interaction between the program and its environment. Our plan would be to record every function call with parameters in a log then later be able to replay them. Let's first do the recording. We've already seen our tracing functions where we can access the individual execution frames of a run. We know that each frame contains the function name as well as all local variables. When a method gets called this list of local variables is actually the list of arguments for the function. If, for instance, I have set up the traceit function, which gets called for every line as always. It also gets called when a function is called. In this case the event variable up here has a value of call. If this happens then we print out the name of the current function, and we print out the local variables, which should be the arguments. Let's try out whether this works. What we see here is we have a call to remove html markup function and s as precisely the value as here in our code. We would now like to turn the function name as well as the argument list into something that can be translated back into code. We want it to look precisely like this. If we turn this into a string that looks exactly like a function call, then we can pass this to the Python eval function, which takes a string and interprets it as code. This is something we will use later for replaying. Here you can see how the string is interpreted, and the result is printed. Here's the function that takes care of the printing of the local variables that is the arguments. Since we don't know the position of the individual parameters in the function, what we use is we print out named arguments instead. That is, we print out name equals value. This will be separated by commas. Now again we will remove html markup with this parameter. If everything goes well, the traceit function to print out this very call with the right argument. This is precisely what happens. You see, we print out remove html markup with the parameter s being the string foo exactly as over here except that we do have a named parameter and not a positional parameter. The same mechanism works for tracing or recalls as long as these are Python functions. So if we call square root of 2, for instance This would also get traced and reported in our output. As you see here, here is the count of square root and x has a value of 2. Now let's assume that we have a variable, which actually has stored all these calls. We could do so by recording it right away while the code is executing, or we could also say store this in a file and read this back from a file. Your job now is to write a piece of code that evaluates all these calls. First this one, and then the other one, and which prints out the appropriate results for each of these calls in the form function argument equals return value. Over to you.

09 s Record Calls

So here's the answer to our programming exercise: We simply set up a loop over our calls, evaluate the calls as recorded, store the result in the return variable, and then print out the call, an equal sign, and the result. Let's see what this does. And you see, we actually invoke remove HTML markup with the recorded value and get a new result, and we invoke square root again with the recorded value and get a square root of 2.

10 l Paths of Improvement

With this we have provided a basic mechanism to record and replay arbitrary function calls together with their values. This mechanism, in general, allows to record arbitrary interaction between program parts, as well as arbitrary interaction between the program and its environment. But this is just a tiny slice of what can be done in record and replay. Actually, while being pretty general, it's also pretty limited yet. In particular, one could store the list of calls in a file such that it can be reproduced at any time, handle structured elements properly, that is, lists and maps, record, and possibly replay, return values too. Record global variables, or record only a subset of all calls. Python makes recording and replaying function calls very easy. Especially compared to compiled languages such as Java or C. So feel free to toy around and expand this code.

11 l Other Facets

At this point we've seen how to handle static data, how to capture and replay user interaction and even a generic mechanism to record and replay the interaction with the environment. Let me now briefly discuss some additional items that are of particular importance during debugging. The key concept for all of these is to make non-deterministic behavior deterministic and controllable at the same time. First, time. If your program depends on real dates and times, be sure to provide a means to set these, for diagnostic purposes. If it depends on real timing, make these limits controllable as well. Second, randomness. If your program has planned, non-deterministic behavior, for instance, if you're building a game, ensure that you can reproduce this at will. If you use a pseudorandom generator, be sure to save its seed. If you use a true random generator, save the sequence of random numbers. Third, schedules. Multithread programs may non-deterministically switch between threads. Again, find means to make these thread switches deterministic. Otherwise, you may run into hard-to-reproduce thread scheduling issues. Fourth, physical influence. Depending on your environment, it is possible that some physical influence causes software errors. If you place a strong alpha emitter on top of your CPU or your RAM, this will likely cause malfunctions. Cosmic rays may also result in spurious errors. But this is more an electronics problem rather than a software problem, so we leave it as it is. Fifth, debugging tools. Debuggers instument the code and alter its execution. The least they do is to influence the real timing. Therefore, they may induce Heisenbugs, again, bugs that appear or disappear when being observed.

12 q Buggy Laptop

Let me close this section with my favorite story on bug reproduction. In 2002, I went to a conference to give a laptop demonstration of a complex software system. Before I left, we had taken every precaution so that the software would run perfectly. But then came the evening before the demonstration. I would be sitting in my hotel room and try out, for the last time, so everything would be working fine. Nothing worked. The software would run into timeout errors anytime, everywhere. I phoned my department at home. They had the same software, on the same machine, and everything worked. I debugged this for 3 hours in a row. My battery had even run out of power, so I had to recharge it. That's when I decided to restart from scratch. To my amazement, the demo now ran just fine. And so I wondered, why did the program work after restarting from scratch? Was this because of some caching problem? Did this have something to do with the battery? Was this because I reinstalled everything? Or because of cosmic radiation? Hint: I gave you an indication right within the story. Over to you.

12 s Buggy Laptop

And now for the answer. Indeed, I gave a hint right within the story. "Battery" is the correct answer. After 3 hours, my battery had run out. Which means that until then, I had worked on battery power. In order to recharge my battery, I reconnected to AC power. And it turned out that it was this difference that caused the program to fail. When my laptop ran on AC power, which was precisely the configuration with which everything was installed, the program worked just fine. If it ran on a battery, however, the program failed. It turned out that my laptop ran slower when running on batteries. This saved energy, but also introduced interesting timeout errors. So I made a mental note to fix this as soon as we got home, and of course, I gave the demo on AC power to make sure that all our tests were not in vain.

13 l Capturing Coverage

If a program fails in the field, rather than in the lab, this makes reproduction hard at first because you lack all the data you might need to do efficient debugging. However, having programs execute in the field can actually also ease--sorry. However, having programs that execute in the field can actually also ease debugging. The basic idea is to exploit the many runs that are out there in the field. Suppose you have a number of runs, hopefully few, that fail and a number of runs, hopefully many, that pass. Or that don't fail. What you could now look for is features that correlate with failure. Sorry, we'll do this again. What you could look for is features of the execution that statistically correlate with failure. One such feature could be the execution of individual parts of the program. If you knew, for instance, that in all the nonfailing runs a specific function is never called, whereas the same function is always called in all the failing runs, and this information goes to you, of course the first thing you would do is to look at the function, F, in order to see what is it that could possibly go wrong in here. Instead of F, you could also come up with other execution features. For instance, this could be a specific line, or a specific return value, or a specific variable value. All of these features could possibly correlate with success or failure and thus provide important hints when it comes to the debugging. To get this to work, we need to collect these execution features. And we'll start with executed lines because this is plain and simple coverage. The idea, again, is to use a tracing function, and the tracing function would record for every single line in the file, whether it has been reached. So we have this global coverage variable and if we reach a new line, we extract the file name and the line number. And what we now do, if we do have the file name and the line number, is we check whether we've seen the file name before, and if we haven't, we create a new empty set. Then we add the line number to the set we found for this file name. Here again we set the tracer, and now we simply invoke this on a simple text without HTML markup. Let's see which lines are actually covered. Click on run, and here's the set. Lines 4, 5, 6, 8, 9, and so on, are being covered in here. Let's go and mark these lines, which are covered here in the code. There's line 4, 5, 6, 8, 9, 11, 13, 15, 16, and 18. The lines that are not covered are line 10, line 12, and line 14. That is all the lines that would set tag to specific values, or set quote to specific values. Since the result of passing some text without HTML markup is actually correct, we could now deduce that maybe we should come up with tests that check for these lines which we haven't covered yet, and see whether these fail, and then we could find out that errors would actually be related to the execution of these lines. But first, let's come up with an automatic mechanism to produce this markup; that is, whether a line is executed or not. The idea is to produce the program as output and prefix every single line with a star if it has been executed and with a blank if not. So we need to turn this set into the code prefixed with stars or not. Here is a starting point. Since we already have the file name, we can actually open the file name and read its context as an array of lines. Let's see whether this works. Indeed, what we get is a list of the individual lines in the program.

14 q Star Lines

So here is a bit of a programming exercise: Output each line prefixed with a star and a blank if it's covered and prefixed with 2 spaces if not. Over to you.

14 s Star Lines

Here's the answer to our programming exercise. We set up an index, I, which we let run over all lines. In the coverage array, now I is running from 0 to the number of lines minus 1. However, in coverage, lines are started with 1 and not with 0. So we check whether I + 1 is in the set of lines covered for that specific file name. If the line is in the set, then we print out a star followed by the line. Since the line already includes new line characters, we omit this by adding another comma at the end. If the line is not covered, then we output a blank in front of the line. Let's see whether we get a nice listing together with the covered lines. Unfortunately, our web page IDE strips away all leading blanks. But what we see still is the executed lines prefixed by stars and the nonexecuted lines which have no such prefix. Again we can see that we do have a number of lines that are not executed at all. That is, for this particular input, foo, without HTML markup. A coverage tool such as this one can be very helpful in debugging, as it is in testing. In testing, you'd like to come up with tests that cover as many execution features as possible. So if you see, for instance, that what we have executed so far does not cover these lines, you may wish to come up with additional test cases. Better coverage tools would also count the number of times a line has been executed or save coverage that has been achieved so far in a file and thus allow for accumulating coverage over multiple runs with different inputs. For now, however, we're going to use the computed coverage for debugging purposes. Namely, to find lines whose execution correlate with failure.

15 l Different Input

Based on our coverage tracker, we can now compare the coverage for different inputs. I'm going to use the input foo without markup, as we've just seen it, I'm going to use foo with HTML markup, and I'm going to use the same enclosed in double quotes. Let's first start with foo as we've just seen it. Here the tag line is being executed, initialization of code, initialization of out is being executed. We have the loop, overall characters, we have the 1st condition, we have the 2nd condition, and we have the 3rd condition. As we've seen before, none of these conditions evaluates to true for any of the characters in foo. Instead, only adding characters to out is being executed. The assertion, finally, passes. Now let's see what this looks like for foo with markup. Again, the initialization is carried out for this input as well, as is the loop. However, here this condition immediately becomes true already on the 1st character, so tag is being set to true too. At the end of the tag, tag is being set to false again. We also check this condition, but since we don't have any quotes in there, this line is never executed, and, of course, we add individual characters. The assertion in the end passes. Now let's come to our input which should trigger a failure. That is, the same thing as before but now enclosed in double quotes. Again, all of this is executed, no way to avoid this. The 1st thing that happens in here is that quote is being set over here with the 1st character. And since quote is being set, these 2 conditions never hold. And everything, including the tag characters up here, is being added to the output. Since the markup characters are being added, the assertion finally fails. If you look at the coverage up here, you find that there is one line in the program whose execution correlates with failure. And this is precisely this line, namely, quote = not quote. This line is only executed in the failing run. So all we need in this situation is to look at the governing condition, and lo and behold, we do have a defect. Ha! That was easy, wasn't it? Unfortunately, things are not always that easy.

16 q Tricky Input

So, here's a quiz. Go and provide an input that makes the assertion pass but still execute this line. Over to you.

16 s Tricky Input

So now for the answer. What we need is an input in which quotes are contained within the tag. For instance, this could be an input like less than, double quote, double quote, and greater than. If we go through the program, we would see that tag would be set, then quote would be set, then quote would be unset, and then tag would be unset. The output would be empty, and therefore would not contain any HTML markup. Note, though, that this line would be executed and therefore also appear in the coverage. And then we can simply say that execution of this line directly is related to passing or failing. So when given an arbitrary program, it is unlikely that we will find lines that are only executed when the program fails. And anyway, such lines would be found by the 1st test that executes them. And we are pretty good at testing, aren't we? What we want to look for are lines that statistically correlate with failure. That is, they may occasionally pass, but by and large, they more frequently fail than pass.

17 l Phi Coefficient

How do we compute such statistical correlations? The Pearson correlation coefficient is perhaps the best known indicator of correlation, but this is suitable only for linear dependents. That is if you have 2 ranges of values. What we want instead is a correlation between 2 binary values. On 1 side we have the outcome of the run, either fail or pass. And then for each line we know that it's either covered or uncovered. And we want to come up with a correlation between coverage on one side, and outcome of the run on the other side. So the Pearson correlation coefficient is not appropriate for this. But there is another measure for such correlation, and it was also invented by Pearson. It's called the Phi coefficient. The Phi coefficient starts with a table. In this table you count how frequently a line was covered in failing runs as well as in passing runs. And of course you also count how frequently it was not covered in failing runs as well as in passing runs. These 4 values take the names N11, N10, N01, and N00, respectively. Where the first digit stands for covered versus uncovered, and the second digit stands for fail versus pass. You also compute sums over all rows as well as sums over all columns as well as the total sum of events. This then is the Phi coefficient. It consists of multiplying the values in the first diagonal and subtracting the product of the second diagonal. This is then normalized according to the square root of the sums of the columns and rows. The Phi coefficient is high if 1 diagonal has a high value and the other diagonal has a low value. If it is this diagonal, which has a high value that's down here, then it becomes positive. If it's this diagonal, which has a higher value, then it becomes negative. So the higher the value in 1 diagonal and the lower the value in the other diagonal, the stronger the correlation. Our plan now is we compute the Phi correlation for each line, and then we rank the lines from high correlation to low correlation. And we start, of course, with the strongest correlation. Let me illustrate how the Phi coefficient is computed on a simple example. Let's assume that we have a line in our program and a number of runs. In the first run, line 10 is executed, and the run fails. So what we enter in the table for this line is that we now have seen 1 instance where it covered and failed. Next thing that happens with this line is that it's not covered and the run passes. So we add a 1 in this part of the table. Next thing we see, it's covered and it fails. So the value on that table becomes 2. And then again, we see uncovered and it passes. So now we have seen 2 instances of the line being uncovered and passing. Let's assume these are just the 4 runs that we observe. Then we have seen no instance of the line being covered and passing and no instance of the line being uncovered and failing. Let's quickly sum up the remaining rows and columns. We have seen 2 instances where the line was covered and 2 where it was uncovered. We have seen 2 instances where it failed and 2 instances where it passed. The total number of events is 4.

18 q Calculate Phi

Now here comes the quiz. For line 10, what is the Phi coefficient or the correlation between coverage and failure using this formula. Over to you.

18 s Calculate Phi

Well, that's actually something we can easily compute. The product of these two is 4 minus the product of these two. This is zero so we have 4 minus 0, which is 4. The square root down here is the product of all the individual row and column sums. So this is 2 x 2 x 2 x 2, which is 16. The square root of 16 is also 4. So we have value of 4 up here, value of zero down here, value of 4 down here. Four minus zero divided by 4 is precisely 1, which means that we do have a strong correlation between failure and coverage. Here we have the product of the first diagonal minus the product of the second diagonal, and here we have the square root over the product of the individual sums of the rows and columns. Let's fill the appropriate values of our table and compute the Phi coefficient. We click on run, and we get 1.0. Always nice to see that the computer confirms what we already had thought.