cs259 ยป

CS259 Lesson3

01 l Simplifying Failures

This lesson is about simplifying. Why is simplifying important? Think about a plane catching fire as it leaves the runway. When you think about how could that possibly have occurred, you want to figure out what the cost of failure is. In order to explain how the failure came to be, you want to come up with the minimum of factors that explain how. So, here is our plane with passengers, luggage on board, entertainment system, air condition, and whatnot. Now suppose you had a means to turn back time and repeat history at will. Or suppose all of this were running in a simulation. You could try again and again to see whether the failure occurs or not. Then you could try to simplify things. You could, for instance, remove the passengers and see whether the failure still occurs. Oh, it still occurs. Remove the flight attendant and see whether the failure still occurs. Well, the plane still catches fire. Remove the seats. The plane still catches fire. Remove the air conditioning system and repeat the whole thing. And remove the entertainment system, remove the luggage, and you see that none of this actually is important. Whatever you remove, the plane still catches fire. You can go and remove the airport, and even with all the airport removed, everything happens as before. But as soon as you remove this little metal strap that's lying on the runway, then all of a sudden your plane takes off just as normal. But if that strap is there, then your plane is on fire. What we're going to explore today is how to simplify failures-- that is, finding what is relevant for the failure and what is not-- and how to do this automatically-- that is, automatically finding out the cause of the failure.

02 l Bug Report Nr 24735

This being a course on software debugging, let's take a look at a software problem. Mozilla is the corporation that produces the well-known Firefox browser. In the early days of the browser, Mozilla was still in beta, and they got lots and lots of problem reports from their early customers. Here is bug report #24735. You already see from the number that they got very, very many bug reports. What would happen here is that a user would start the Mozilla browser, would go to the website bugzilla.mozilla.org, which actually holds all the bug reports, including this one, in Mozilla. The browser would render the Bugzilla webpage. There would be a link "search for bug" and users would click on it. Then a number of menus would appear say selecting the operating system, priority or severity, where you would be able to enter or select search terms before pressing the big search button. If the user would now, with this Bugzilla page open, select "Print" from the menu, then the entire browser would crash with a segmentation fault. Such bug reports were not exactly uncommon in the early days of Mozilla. There would always be some HTML input that would come from the web server and then be rendered as a webpage which the browser couldn't properly handle. Let's take a look at the HTML code for bugzilla.mozilla.org.

03 q Mozilla Bugathon

This is what the HTML code of bugzilla.mozilla.org looks. It actually is almost 900 line of HTML, so you feed this into the Mozilla browser, select print, and it crashes. But why? In the early days of Mozilla, problem reports such as this one and failures such as this one came in quicker than Mozilla programmers could possibly simplify them or even look at them. But then the Mozilla product managers had a clever idea. They started the Mozilla Bugathon, which meant that volunteers would go and simplify test cases. This was a very simply process. What you needed was a text editor such as this one, and then you would go and remove parts of the HTML page-- say this, for instance--run this through Mozilla, and see whether the failure still persisted. Again, remove some part of the input and see whether the failure still persisted-- again and again and again. You repeat this until you have a page left, a small page, in which everything is relevant for reproducing the failure. If you do this and keep on removing one part after the other, then all of a sudden you end up a situation where the failure no longer occurs. After you remove some part the failure no longer occurs, what do you do next? Do you put the part you just removed back in and try to remove other parts? Or do you keep only the part that you just removed and remove everything else?

03 s Mozilla Bugathon

Thank you. Here comes the answer. The answer, of course, is you put the part back in, and you try to remove other parts. Why is that so? Well, if you remember our discussion of the scientific method in the first lesson, what you just had is you set up a hypothesis on why the failure occurred. You removed some part, and you observed that after removing this part, your hypothesis on what the failure is no longer holds, so you go back to your earlier hypothesis and remove other parts. Trying to keep only the part and remove everything else makes little sense, and keep on removing until the failure occurs again also is very unlikely to happen.

04 l Motivation Is Important

To make this process of simplifying even more rewarding Mozilla would offer rewards. For five problems simplified, a volunteer would be invited to the Mozilla launch party. For ten test cases simplified, he or she would also get an attractive stuffed animal the Mozilla mascot. For 20 test cases one would get a t-shirt signed by the grateful engineers. Let's see whether I can earn a stuffed animal. Let's remove this part. We invoke Mozilla. Failure still is there. We remove another part, invoke Mozilla, failure is still there. We remove another part, see how it gets simpler and simpler, and the failure still is there. Mozilla still crashes. Now I'm going to remove this entire Bugzilla page to a single line. If I now invoke Mozilla and invoke print, Mozilla crashes.

05 q Simplifying Further

How can we simplify this further? Should we try to remove NAME equal operating system? Remove the multiple attribute? Remove size equals 7? Remove select? Or remove the angled braces? Pick those three which have the highest chance of still reproducing the failure. Over to you.

05 s Simplifying Further

Hello again. Well, that was a bit of guesswork. But I think you might have guessed the right answer. We can do so by elimination. If we remove the angled brackets, the all of this HTML tag is going to turn into regular text. This will create a very different behavior in Mozilla, and therefore we are probably not going to reach the code which originally caused the failure. The same goes for select. If we remove the attributes--that is, name equals operating system or multiple or size equals 7--then our input will still be a select tag with no attributes at all, and this will still trigger the original failure. So, you see that in this big HTML input, only eight characters are relevant for reproducing the problem. This will certainly give us nice reward from the Mozilla developers.

06 l Simple Is Better

So, we have seen how to reduce an 800-line input to a single line and even reduce this single line even further to 8 characters that reproduce the bug. This is also what the Mozilla volunteers did, and within one night the first volunteers already had earned their t-shirts. This is obviously simpler than 800-lines of HTML code, but what is it that makes something simpler? Or when do we say that something is simpler than something else? What is it that makes something simple? First, there is the burden it takes to understand something. In debugging, an input that takes me 10 seconds to understand is simpler than an input that takes me 10 minutes. This is why one line of HTML is better than 800 lines of HTML, because the one line of HTML has a far smaller burden to understand why something went wrong. Second, the less burden it takes to explain something, the simpler it is, or maybe the other way around. If something is simpler, it takes less effort or explain something. If I say Mozilla cannot print a select tag, then it is way simpler than saying Mozilla can't print this Bugzilla page with the following 800 lines of code. Plus, a short explanation also points me directly to the code in question. If I know Mozilla can't print a select tag, and I go directly to the code in which Mozilla handles printing select tags. All of his, however, refers to humans. It is the human burden we're talking about. Things that are simple are easy to explain and easy to understand, but what is complex to one can be simple to others and vice versa. The same thing can be explained in simple terms and also in complex terms.

07 q What Is Relevant

In debugging, simplicity eases our lives from the very beginning. As an example, consider bug reports. If you ever maintain a big piece of software with dozens and dozens and maybe even thousands and thousands of users, you may have gotten feedback from users--that is, bug reports. In the early 2000s, I coauthored a debugger named GNU DDD. It was pretty successful, so it had many users and I also got lots and lots of bug reports from the field. These bug reports--even in these early times--of course came electronic mail. But I'm drawing actual snail mail, because it's cuter. Some of these bug reports would just contain the required information, such as the steps needed to reproduce the problem as well as the observed behavior, but then there would also be people who would send in their entire programs that they were currently debugging. Some would send in their entire home directory or even the contents of their entire hard disk just in case this would be needed to reproduce the problem. And others would send in just the one liner, "Your program crashed," which was less information than I would need. What you'd like to have in a bug report is information that is relevant. Here relevant means if it's different then it changes the behavior. Of course, when users are submitting bug reports they don't necessarily know whether their information changes the behavior, so to stay on the safe side they supply more information than would actually be required, but in order to have a simple bug report or a simple test case, we want information in there to be relevant. This calls for a quiz. When we simplify a test case, just as we did before for the HTML input, then everything is what? Simple, correct, relevant, or elegant? Over to you.

07 s What Is Relevant

And now for the answer. The correct answer after all we've discussed, of course, is relevant, meaning that every single information that remains in the input, changes the outcome if it is changed as well. If we say that Mozilla crashes with these 8 characters, removing any of these characters changes the outcome. Therefore, all of these 8 characters are relevant for the failure of Mozilla. Of course, this input also is simple, but this is not necessarily always the case. You can have very, very complex failures where lots and lots of information is relevant but still hard to understand and therefore not necessarily simple. Everything is correct. I don't know what a correct input is. This cannot be generalized. Elegant? Well, I'm not sure what an elegant test case is. There is elegant code and elegant fixes, but this does not necessarily apply. So, the correct answer is relevant.

08 q The Tao of Simplifying

To put it with Steve McConnell, the goal of simplifying the test case is to make it so simple that changing any aspect of it changes the behavior of the error. Or, in our words, to make it so simple that every aspect in it is relevant. There is also a nice quote which is usually attributed to Albert Einstein. "Everything should be made as simple as possible, but no simpler." This of course calls for a simple process for simplifying. Namely, for every circumstance of the problem, check whether it's relevant for the problem to occur. If it is not relevant, remove it from the problem report or the test case in question. To see whether you got all this, let's have a quiz. Namely a word problem first stated by French novelist Gustave Flaubert in 1843. A ship sails the ocean. It left Boston with a cargo of wool. It grosses 200 tons. It is bound for Le Havre in France. The main mast is broken. The cabin boy is on deck. There are 12 passengers on board. The wind is blowing east northeast. The clock points to a quarter past 3 in the afternoon. It is the month of May. Now, for the question. How old is the captain? Over to you?

08 s The Tao of Simplifying

Okay, that of course was not a real quiz. In fact, Flaubert's word problem is most famous for providing all these details, which all turn out to be completely irrelevant. Actually, none of this information helps in answering the central question. In debugging, we need to do the same. We need to found out which information is relevant and eliminate all the irrelevant information.

09 q Importance of Simplifying

Now, for a real quiz. Why is simplifying test cases important in debugging. Is this so because a simplified test case is easier to communicate? Or is this so because a simplified test case usually means smaller fixes? Or because it usually means smaller program states? Or because it usually means fewer program steps? Check all that apply. Over to you.

09 s Importance of Simplifying

Now for the answer. Let's check all that apply. A simplified test case is easier to communicate. This is correct, because if it's simpler, takes less time, easier to understand. A simplified test case usually means small fixes. Unfortunately, that's not the case. It may mean faster fixes, but it does not mean smaller fixes. The fix will be the same whether the test case is in its original complex form or whether it's simplified. A simplified test case usually means smaller program states. Yes. If the input is smaller, then all the data structures that represent the input will be smaller as well. And, a simplified test case usually means fewer program steps. This is correct as well, because it takes fewer steps to process the simplified input. And there we go.

10 l Automated Simplification

Now let's get back to our simplification scheme. Obviously, simplification works, but if you as a human do simplification, you may not be entirely happy with it. That is because manual simplification is tedious and boring. That is tedious, because you have to run these tests manually again and again and again. Second, it's boring, because it's a rather mechanical activity without great intellectual challenge. I mean, we just keep on stripping away some input, invoke the program again and again and again. Who really wants to do this? Fortunately, there's an answer to that. We do have something that helps us specifically for tedious and boring activities. This is called a computer, because we invented automatic machines precisely to relieve us from tedious and boring tasks. We're going to hand over the process of simplification over to a machine, resulting in automatic simplification. Can we do that? For automatic simplification, we need two ingredients. We need a strategy that does the simplification for us-- basically that tells us how to simplify. And we need an automatic test that checks whether our simplification succeeded or not. Let's first stat with the automatic test. For the Mozilla example, we would need a test that takes HTML input, starts the Mozilla web browser, feeds the HTML input into the browser, and then finally checks whether Mozilla fails or whether it succeeds. All of this needs to go into our automatic test. There is a number of ways to set up such tests. For instance, we could start Mozilla and then replay appropriate user events and see whether it crashes or not. We could also write JavaScript programs and have Mozilla execute them automatically. That is, we'd right a piece of JavaScript that would read the HTML input and then check whether this causes a crash or not. For this course, however, we'll set up a simulation that will do the very same thing, but actually without involving Mozilla. We will simply write a little Python function that does all the testing for us and all of this automatically on the machine.

11 l Fake Mozilla

What we want is a function called "test" which takes a string s--this would be the HTML input-- and if there is HTML input s that would cause Mozilla to crash, tests should return fail. Otherwise, it should return pass. We're simply going to search whether s contains a select tag. For this we set up a simple regular expression, which says if there is a some substring in s which consists of an opening tag, select, and then an arbitrary number of characters that are not a closing tag--that is not a greater sign-- this is expressed by this caret over here, meaning not the greater sign-- not the greater sign and then star meaning 0 or more instances of that and finally a closing tag. We have a less than sign, select, then anything except for a greater than sign, and finally a greater than sign. This is what makes an HTML tag. If this occurs within our string, we return fail. Otherwise, we return pass. This regular expression requires a function from the RE module, but we first need to import this. So, let's see what happens if we feed in a just a regular HTML tag into the testing function. What we get is pass, because there is no select tag in here. Let's expand this a bit, and now let's actually add a select tag in here as in this example. Now we see that our output changes to fail, because in here we actually do have a select tag. This is what our testing function does. It simply simulates the behavior of Mozilla. Normally, we would pass this into Mozilla and then see whether it crashes or not, but in order to keep things simple, we're simply having this simple testing function, which checks directly for what is supposed to be in the input.

12 p Implementing the Strategy

So far for the automated test now comes the simplification strategy. What could such a strategy possibly look like. In their book, The Practice of Programming, Kernighan and Pike describe a simple binary search simplification strategy. Throw away half the input and see if the output is still wrong. If not, go back to the previous state and discard the other half of the input. This is something that Kernighan and Pike meant to be used in a manual setting, which actually requires your editor in which you are editing the input to have a good undo facility. However, we can also translate this binary search approach directly into a program which does the simplification for us. We assume there is a function simplify() which takes an input, and based on the test automatically simplifies it. We assume that the test we're passing to the simplify function already fails. Since we want to test the two halves of the input separately, we need to split it into two parts. We're figuring out what the length of the input is, divide this by two, and we split the input into strings s1 and s2, which go up to this index, and start with this index such that s1 plus s2 concatenated still result in the original string. Now we test the first substring. If it's fail, then we proceed simplifying this first substring. What should we do if the first test does not fail? Well, I guess we then must go and work on the second string. How do you do that? Over to you?

12 s Implementing the Strategy

So, here is the answer. The second substring is actually being handled exactly like the first substring. If its test fails, then we go into simplify this substring instead. If neither substring fails, then we return the original string.

13 q Simplifyer in Action

So, what happens if we take this big HTML input and run it through our simplification procedure? Let's find this out. We click on "Run." You see that the simplification algorithm correctly returns select, which actually means that it has been able to very nicely simplify this complex input into just these eight characters. Let's take a look into how this actually operates. I'm inserting a print statement in here, which gives us a representation of the string as well as it's length. We can see whatever simplify has been working on. Down here we can see how simplify() operates. It first takes the entire string. Then it takes the first half and tests it. Fails again. Of this it takes the first half, fails again. Takes the first half, fails again, and there we go. With a technique like this, we can automatically simplify any input. All we need is an automated test that tells us whether the program in question passes or fails. Unfortunately, at this point our strategy is not yet perfect. We'll have to refine it a bit, because it doesn't work well for all inputs. What happens if we put in another HTML input--that is select, foo, and end of select? What do we get as a simplified input with this function? So, here's our quiz. What's the result of simplify() here? With this input: foo. Is it a string consisting of the eight characters ? Is it a string consisting of and the first two characters of foo? Is it a string consisting of and all of foo? Or is it the entire HTML input? Over to you?

13 s Simplifyer in Action

Now for the answer. We can find this out by ourselves simply by running this in the IDE. Here is our input. I click on Run. You can see that the binary search simplification does not simplify as much as one could imagine. In the beginning we had this 20-character string, which would then be reduced to this 10-character string just by splitting this in half. Now if we split this again into two 5-character strings, none of the two halves will actually contain a select tag. Both will pass. Therefore, we'll stick with the original string, which in this case simply then is followed by "fo." So, for the quiz, fo is the correct answer.

14 l Delta Debugging

Let's illustrate once more what happens if we take this HTML string and put this through our binary search simplification process. Here is our HTML string, and we first check the first half and see if it fails. Yes, this also fails. But now we are splitting this string into two substrings, namely this one. It passes. Now the second one, and it passes as well. In this case the binary search algorithm is stuck, because the first half doesn't result in a simplification, second half doesn't result in a simplification either. So, we stop here and simply return the entire string, which is what binary search does. As humans, of course, we have a good idea on what to do next. What we would do is we could for instance go and cut away smaller parts. For instance, in here, we already would see the structure-- and fo. We could, for instance, go and remove the "fo" because it's different from the HTML tag. A computer may not recognize these structures, but what a computer can do still is to try to remove smaller chunks. That is, rather than trying to remove entire halves, we would try to remove, say, quarters or eighths of the input until we're down to removing single characters. What we do in this situation is we increase the granularity. That is, we take away smaller parts more frequently until we end up removing single characters such that we really, really get an input where every single character or every single part is relevant for reproducing the failure. This precise strategy is called delta debugging and here's how it works.

15 q Next Step

First, we split the input into n subsets where initially n has a value of 2. Second, if removing any of these subsets fails, proceed with this very subset. This is what we have for our input. We split the input into initially two subsets, and if any of these subsets fails, as in here, we proceed with this very subset. Otherwise, we simply go and increase the granularity. That is, we double the number of subsets. We go again to step one. Let's try this out on our input over here. So, at this point we already tried out removing the second half of the input and removing the first half of the input. Neither removal of these subsets would fail, so now we have to increase the granularity by coming up with now four subsets. This is a 10-character input. Each of our subsets will have between two and three characters. For a change, let's try removing the first two characters. Well, no select tag, no failure. We put them back and remove the three characters. Still no failure, which gets us an "s ct" tag. No failure. Now, again we remove the next three characters--ct and the greater tag. No select tag, no failure. Now we remove the last two characters. At this point, we have the select tag by itself, and again we have a failure. What should we do in this case? Should we now go and start again with two subsets? That is, setting n to 2? Or should we go and set n to the number of subsets we had before minus 1 because there's 1 less? Experiments show that it's best to stick with a granularity you once have achieved, because then you won't have to track back through all the levels of granularity you already have seen. So, we set n to n-1, which means that in our case we would have three subsets. However, we don't want n to be come less than 2, because otherwise we'd just have one subset. So, we set n to the maximum of n -1 and 2 such that n can never fall below 2. At this point, our input is just the select tag. It fails, and the number of subsets is 3. What is it that delta debugging would do next according to this description of the algorithm? And what is the next test that delta debugging has to conduct? Is this the select tag as a whole? Is this the select tag with the second half removed? Is this the select tag with the first 3rd removed? Or is this just ct> that is the last 3rd of the input. Over to you.

15 s Next Step

Now for the answer. At this point, n is equal to three, so we need to split the the input-- that is, the select tag--into three subsets. We now have eight characters in here, so the first subset has either two or three characters. This is not the right answer, because we now need to remove these subsets. This is the entire set. There's nothing removed from here. Over here there is not a 3rd removed but an entire half, namely the second half. Therefore, this is wrong as well. Over here, however, we have the first 3rd removed. That is, the less than sign and the s have been removed. This could possibly be a next test for delta debugging. Other alternatives could also be the less than sign, the s, and the e, which have been removed, but this option is not in here. The last option is just testing the last third of the input, and this means that not only 1/3 but 2/3 is removed, and this is also not in the description of the algorithm. This is the next test that delta debugging conducts.

16 q Number of Tests

Removing any of these subsets will result in a passing test, which means that we again have to increase the granularity by doubling n in here. Now we split the input into six subsets, resulting in six more attempts to remove individual parts-- all of them will pass. Then we have to increase the granularity again, which would be 12. But now our input only has 8 characters, and we can't split an 8 character input into 12 subsets unless these subsets would be empty. We're going to fix this a bit by setting up this minimum function over here. We make sure that n cannot grow beyond the size of the input, which means that in our case n would now be 8. This is the final stage of delta debugging where it tries to remove one character after another, but all of these removals result in passing test cases. What we get in the end is this very string , and we have now shown that every single character in here is relevant, because removing it changes the outcome of a test from fail to pass. What delta debugging gives us in the end is an input--a simplified input-- where every single part is relevant. At this point you may wonder how many tests does delta debugging actually take? You can see that at the very end when every single character has to be removed individually the number of tests, of course, is proportional to the length of the simplified input. In the worst case--that is, for very pathological examples-- the complexity of delta debugging is squared with respect to its input. However, there is a nice situation in which delta debugging simply becomes a binary search and is as efficient as a binary search. So, when does delta debugging need a logarithmic number of tests? Is this when all tests fail? Is this when there is always 1/2 that can be removed and fails? Is this when all tests pass? Or is this never? Check all that apply.

16 s Number of Tests

And now for the answer. Let's check each one by one. When all tests fail that's an interesting situation. It's rather unlikely, but let's assume this actually happens. Then we split the inputs initially into two separate sets, removing the first subset fails we proceed with this subset. N remains 2. And we do this again and we split again the input into two subsets and two subsets again and two two subsets again. This is exactly like binary search described by Kernighan and Pike, which is logarithmic. The next answer: where there always is a half that can be removed and fails, this is the same situation, expect that we may have to remove the first half first and getting a pass. But still in terms of complexity it's still logarithmic in proportion to the size of the input and therefore this is a nice situation in which delta debugging is very, very effective. This is actually one of the strings of delta debugging that if you can split the input into two halves and one of them fails, then it behaves like a binary search. By increasing the granularity later one, then it's no longer as efficient in the number of tests, but is very, very thorough by trying to remove small parts one after the other. It's like a binary search in the beginning--super efficient, and then it's as thorough as trying to remove every single part after another. When all tests pass, then delta debugging does not need a logarithmic number of tests. That's more like a linear number of tests. And never--that's not the case, because in these two situations, then indeed the number of tests is logarithmic. This is the answer.

17 p Finish the Debugger

Now that we have seen how delta debugging works on paper, we should actually go and implement it as a Python program. Here we start implementing the delta debugging function. Formally, this is the delta debugging algorithm for minimizing, so we call it ddmin for delta debugging minimizing. It takes an input, and it takes a string as a parameter. This is the string to be simplified. We could make the ddmin even more generic by also passing the test function as a parameter. Then we could use it for arbitrary tests in arbitrary context. For this exercise, however, we'll simply use the hardwire testing function, which you already have seen. And we start with the assertion that, again, testing the entire input should fail. We have a variable n which sets granularity. As described before, this is initially 2. We now set up a while loop, which loops while the input still has at least two characters. If it has one character, then there's nothing left to simplify. We define the length of the individual subset by simply dividing the length of the entire input by occurring granularity. We have a loop guard in here. This variable will be set as soon as some tests fail. The variable start is a cursor over the input which tells us where to start chopping away individual parts of the input. Here precisely we do the chopping. Start tells us where to start chopping away things. That is, everything up to start should be included in our test. Then beginning from start plus the length of the subset, we also want this to be included in our testing input. What's missing in here, of course, is a string starting from start with the length subset<u>length.</u> This is what we test, and we check whether the result is fail. That is, if after chopping away a subset the test fails, what do we do then? We set our input to the complement. That is, we keep on working with a now simplified string. We decrease the granularity by 1 but don't let it fall below 2, and we set the variable some complement is failing to true, because we need this to step out of this loop, which we do right away. If the test has not failed, then we have to proceed to the next subset, which we get by simply increasing our start cursor, which now points to the beginning of the next subset. This is something we do until we have reached the end of our string. Now, at this point we have gone through all the subsets and either some complement has failed. Then this variable is true. So far so good. We simply keep on working with these new reduced, simplified inputs and a decreased granularity. However, all of these tests pass. What should we do? Remember the third step in the delta debugging algorithm. We now need to increase the granularity but must make sure it doesn't get over the length of the total string. I will leave it to you as a programming exercise to complete this code. Over to you.

17 s Finish the Debugger

Okay, welcome back. Let's finish coding this off. In our original description of delta debugging, we have to increase the granularity at this point. That is, n becomes 2n but must not become longer than the current input we're working on. Our working input here is s and must not become bigger than the length of s. However, if this works we simply double it. If we have reached the total length of s, and then we return the simplified input. Here again we have the HTML input for which the binary search simplification didn't simplify enough. Let's see how delta debugging fares instead. We click on Run. You see the output now is simply , showing that delta debugging has managed to really minimized everything away in here.

18 q How Many Calls

Now for a quiz. How many calls of the testing function-- that is, without assertions--need to comment this out-- does ddmin require in this run-- that is, for this input. Please provide your answer over here. Now, over to you.

18 s How Many Calls

This can be determined easily. All we need is a counter in the testing function. I have extended the test function a bit, such that there would be a test counter up here. With every invocation the test counter will be increased by 1 and so that we see what's going on behind the scenes I'm printing out the current value of test<u>count--</u> that is, the number of the test that is being executed-- and printing out the string that we're testing, and I'm printing out its current length. I'm also printing out the result--fail or pass. Now let's see what we get when we're executing this. You see this is the initial test. This is coming from the assertion. First test, select of foo, 20 characters long--failing. What you can see here is that delta debugging quickly progresses in its binary search phase from 20 to 10 to 5. Well, 5 doesn't work so it sticks with 8. In the 10th test already it already has determined the minimum. However, it still tries to remove single characters in here to make sure that the final result is really relevant. Overall, it takes 22 tests, but since the first test is due to the assertion, we overall have 21. The correct answer therefore is 21.

19 q Generic Tests

What we have seen now is how delta debugging can take a specific test and use this specific test in order to automatically simplify a fairly complex input to an input where every single character is relevant. So far, however, we need to set up a specific test function for every failure. Is there a way to come up with generic tests that can be used for arbitrary situation? Here again, we have a testing function. And we want to turn this testing function into something that is more generic. For instance that simply checks whether some assertion has been raised. Here, I have converted this testing function into something way more generic. All that's left in here is the call to the function I want to test. In this case, remove HTML marker. If this function raises an assertion error--for instance because the built-in assertion failed, then the test fails as well. If there is no assertion that failed, then this simply return pass. This test function now can be adapted again and again for arbitrary function as long as they have an assertion or some other run time check that makes them fail. Here is a remove HTML marker function from the previous lessons and it has an assertion to make sure that the result no longer contain any HTML mark. And this is precisely what we can test in here. If the assertion fails and the test will fail and if the assertion does not fail, then the test passes. Remember this notorious input "HTML to regular text HTML marker." Which would not work at all when removing HTML marker. Indeed, this would make the assertion fail because the output would still contain HTML marker. Now let's go and use delta debugging to minimize this input such that every character in the input is relevant for producing the error. For a remove HTML marker, what is the minimal substring of this string including the double quotes that triggers the failure and as a hint it's just two characters long--over to you.

19 s Generic Tests

Aanswer to our quiz--we can find this out by running delta debugging. Here we are again in our editor, and now we simply invoke ddmin on the HTML input to find out which part of the input actually causes the program to fail. You press on run and here we go. It takes 10 tests until delta debugging has simplified the input just to two characters and this is a quote and a less than sign. These two characters already sufficed to trigger the failure. This simplified input dramatically reduces the number of steps it takes us to go through the remove HTML model fashion. We have a quote--if a quote becomes set, this is the error. Since quote is set, tag does not become true and the quote gets added to the result, which causes the assertion to fail. Just two characters, fewer steps, smallest state--all of these are advantages of a simplified input. And just for the record, the correct answer is double quote and less than sign.

20 l Fuzz Inputs

Automatic simplification works especially well in situations where the tests are also generated automatically. If you have taken a testing class recently, you may have heard about fuzz test. Fuzz testing is a technique where you generate random inputs for a program or an API. It's a common testing method to see whether something breaks. To show you how fuzz testing works, let's first write a fuzzer, which is a function which generates such a random input. So here's our fuzzer function. We want to have a function that generates strings of up to 1023 characters filled with regular printable ASCII symbols. So first we determine the string length we'd like to have and for that, simply take a random number-- random.random returns a random number between 0 and 1 and multiply this with 1024, so we get a length of up to 1023 characters. And here we have a simple loop over the string length. We generate a character which takes an arbitrary value between 32 and 128 by using again the random function multiplied with 96 adding 32. We add the character to the out string and return the out sting at the end. Let us see what fuzzer produces--press on run and here we have a non-string filled with random characters. If we have a look at it again, we're going to get another random string, and here we go. I think what happens if I take such a random string and paste it into a function that accepts say a filename or an email address or a URL, all of these functions work well even when confounded with seemingly totally random input. In the age of the Internet, any input that is under control by a third parties can easily look like this, and fuzz testing is indeed one of the attack vectors of people wanting to penetrate into your system.

21 p Mystery Test

Now, let me introduce you to the mystery test function. This is a function which returns either pass or fail. We would like to use fuzzer input in order to make it fail. Here's some fuzzer input, which we can pass to mystery test. Let's see where they passes or fails on that. Well, for this input, it passes. We now, however, is again a fuzzer generated input for which mystery test now fails. The first input passes, but the second input fails. So you wonder what it is in this input that makes a function fail. How can you find this out straightforward--by using delta debugging in order to simplify the first input such that you know which part of the first input actually causes mystery test to fail. And here is programming exercise, find the minimal input that causes mystery test to fail, and for this, use a fuzzer such that you'll find input that causes mystery test to fail. And once through fuzzing you found such an input, use delta debugging in order to minimize the input. Over to you.

21 s Mystery Test

Here we are again, we have a mystery test function and what is it actually in here that causes the input? What is it that causes mystery test to fail? Well, we can again use delta debugging to figure this out. These are implementation of delta debugging. I'm going to extend this such that the test now comes as a parameter. And then, we can invert ddmin with an appropriate test just as we'd like. And I'm going to invert the ddmin with this input for which you already know that it causes mystery test to fail and then passing mystery test as a parameter. Let's see what's in here that causes mystery test to fail. Press 1, we press 1 and we see a single character, a single dot suffices. A single dot suffices to cause mystery test to fail. So what we have here is classical set. First, we use a fuzz tester to test a program and then we repeat and repeat this until the program breaks. Then we feed this into delta debugging which again runs the program again and again until we get a minimal failing input such as a single dot. Such minimal failing inputs are very important as it comes to convincing other people to fix a bug. Suppose your program for testing is a huge SQL server. Suppose you use a first tester to generate extremely long and complex SQL query that you sent to the SQL server. And then, the SQL server is going to choke on one of these super, super complex inputs and then you say to the SQL developers, "Hey, I have this very, very long and complicated SQL query and I can use this to break the server," but the developers are going to tell you, "Ha, a big query like this one, never ever happens in practice". So maybe this will crash our server but it's really not the top priority right now. We'd rather care about real queries, thank you very much. So what you'd do then is you run this big, big complicated query to delta debugging which then will give you some minimal failing input. which now will be way, way smaller and actually look like something so simple that it can actually happen in practice. And all of a sudden, the SQL developers will take you very seriously because all of a sudden this looks like something that could happen in practice. True story from Microsoft.

22 l Yesterday it Worked

Delta debugging can be used to simplify other domains rather than just input. For instance, it can be used to simplify changes. Let me illustrate this. For instance, it can also be used to simplify changes to your code. When would such a thing be needed? Well, simple situation. This is you and this is your major web hosting service The building or infrastructure and lots and lots of third party open-source libraries and since you want to keep your service up-to-date with respect to security patches and everything, you have to update these libraries on regular intervals So you're getting patches that is large changes to a code base on regular intervals. This is the same as upgrading the libraries to the newest version and the changes between versions that is the patches, can easily contain 10,000s of source lines. Here comes in new patch, patch 3 also with 10,000 source code lines. All of a sudden, your infrastructure fails after applying the patch. What is the cause of the error. This is an instance of a problem that's more generally called, yesterday, my program worked, today it does not--why? Changed something--as a result of the change, something no longer works and you want to figure out what is it in that change that makes the program go wrong? Again, we can use delta debugging to simplify this. Assume the patch that causes the failure can be broken down into several subpatches. For instance, these 10,000 of changed lines would be different locations of the program and each of these locations that means a separate place to patch or a different subpatch. The idea would now be that maybe a subset of these subpatches would already suffice to create the error. In other words, we would simplify the patch to find a minimal subset of the patch that creates the failure.

23 l Blame O Meter

Let us simulate the situation in our IE. Let us assume the individual subpatches all identified by numbers from 1 to 8. Again, we write a testing function that simulates the behavior of the system. We assume that if the patches 3 and 6 are applied, then the failure occurs. If patches 3 and 6 are not applied together, then the failure does not occur. Thanks to the beauty of Python, all of our infrastructures we had previously set up for string also works for lists. Let's see what happens if we test the entire set of patches. We see it fails, which is not much of a surprise because 3 and 6 are both in here. If we test the empty list instead, that is no patches applied, then the test passes. Again, we print out the list we're testing, its length and the result. Now, we can again invoke delta debugging with the set of patches and the above test function. Let's see how this works. We invoke ddmin to see whether we can simplify this list of patches to see which of these patches are relevant. Press on run, here we get the result, and the final result is patches 3 and 6 suffice to cause the failure. Well, that's something we knew all along, didn't we? If your original patch would have been 10,000s of lines long, then chances are that delta debugging reduces this to just two lines or another very small set of locations in your code for which you didn't know. But if you change these two lines, the failure will occur and if you don't change them, the failure will not occur. This approach for determining the culprit for regression has also been named the blame-o-meter, as a means to know who to blame. Where who to blame means which places in the code are to blame but it can also mean which patches are to blame and it can also mean which programmers made these changes which are now to blame. The idea is that such a scheme can be used in any situation in which an old version work and a new version does not. So when you have any regression test failing and the regression test does exactly that comparing the results of the new version against the old version, you can and possibly should run delta debugging in order to figure out which lines in the code are actually responsible for the failure.

24 q Is It Fixed

The only problem with delta debugging is that the test function is somewhat elaborate. It first must apply the patches to the code, then build the code, then run the test, and this again and again and again, which implies that your build facility must be automatic and of course, your version control system should also be able to produce exact and small differences between versions. There's even a version control system where such a scheme is already built-in. This is called git and the command git bisect will give you the exact change between two versions stored in git such that the old version will not have the failure and the new version will have the failure. So this does something very similar to delta debugging--pointing out the culprit, which has been changed such that the failure occurs. And now, for a quiz, assume that delta debugging gives you a failure-inducing change and you now go and undo the change that is revert to the previous version for these locations as returned by delta debugging. What is the effect--is it that the program builds normally, the failure no longer occurs, or is it that the problem is properly fixed. Check all that apply.

24 s Is It Fixed

Now for the answer--if I undo the failure-inducing change as returned by delta debugging, can I actually build the program. Yes, I can because this actually tested by delta debugging that the alternative makes the program passed, so the program should build normally and also the failure should no longer occur. Whether the problem is properly fixed though is another thing. I may have fixed the failure in question, but by undoing these changes, I may have introduced new bugs. In particular, these changes appear they're probably made with some specific feature in mind and the very least that will happen is that I'll lose the new feature that would be introduced by these changes, so I can't really say that the problem is properly fixed. I need to come up with a change that includes the new features or at least the effect or at least the intended effect of these original changes, but avoids the bad effect mainly the failure--so this is the final answer..

25 l Causes and Causality

On the more philosophical level, what delta debugging returns, is not only a simplified set but is also a cause for the failure. What is the cause anyway? Assume i have a window and here comes a ball flying to the window and the window breaks. You would think straight forward that the ball is the cause for the window to break. But why is it the cause? This is so called counterfactual definition of causality which is the most used in our context. It works as follows: We have two events, and A comes before B. Say A, is the ball flying towards the window? and B, is the window shattering? We now say that A causes B, if B had not occurred if A had not occurred. Applied to our example, we can say that the ball causes the window to shatter if the window had not shatter if the ball has not arrived which actually is true. With the ball, the window shatters. Without the ball, the window does not shatter. And since the ball precede the shattering, everything is in place to say that the ball caused the window to shatter. If this sounds complicated, we haven't seen much. Causality is one of the most disputed subject in Philosophy. Some philosophers are even saying that causality does not really exist. It's just an illusion. For our purposes; however, that is within debugging, this counterfactual definition does a good job. First, we have the ball then we have the window crashing. Or, first, we have the defect and then we have the failure. So why does delta debugging returned cause? Simply because there might be a large set of events that all could or could not contribute to the failure. The sun is shining. The clouds are in the sky. There's a nice green lawn. And whatever else, but none of this is relevant for shattering the window. Only the ball by simplifying the scenery towards a single element that is responsible for the test to fail, delta debugging returns the cause.

26 l Causes and Errors

This notion of causality is important in debugging because when you fix a defect we want to make sure to think first. We want to make sure that the defect is actually error, that is it is wrong. We also want to make sure that the defect causes the failure if A is the defect and B is the failure. We want to make sure that if the defect hadn't been there then the failure must not occur either. And these are the two things we need to show when we fix a defect. In order to make sure that the defect is an error, we must show how to correct it and to show that the defect causes the failure, we must show that the failure no longer occurs after we have changed it. It is important to show both the defect is an error and the cause in order to avoid situations in which a defect only is an error but does not cause the failure or that something causes the failure but is not an error. Let's look at the these. First, you can have an error that's not a cause. Here's you, looking at your program and you look at the program and find tons and tons of errors. Here's one, here's one, and there's one and you fix them and you think you're done, but you don't know whether any of these actually fixes the problem because it may well be that these errors you find that none of them actually causes the failure. It simply means that after any fix you need to verify whether your fix actually did the trick and better yet before the fix, you should be able to precisely predict that your change will actually fix the failure. The second category is causes that are not errors. If you think of our Mozilla example for instance, we found that the select tech in the input causes Mozilla to crash when printing, however, this is just a cause, it is not an error. This is perfectly legal html input and at the very least some input that Mozilla should normally be able to cope with. In particular Mozilla should not crash on any input. What we have here is the cause, but we haven't find the error yet. The error is somewhere in the, code probably in the piece of code that handles the printing of select tags. We can have errors that are not causes and causes that are not errors. The goal in debugging is to find errors that actually caused the failure.

27 q Actual Cause

Finally, the general concept of a cause can be misleading. To think about the window shuttering, we said it was the ball. It's right. But, it could also have been the window maker. If the window maker had not made this particular window then the window would not have shuttered and if the house builder had not built this particular house with the window in it then the window would not have shuttered either. For a single effect, there can be multiple causes, and all of these are valid according to our earlier definition. This is where the concept of an actual cause comes in handy. An actual cause assumes and changes as little as possible, but it changes the effect. If we think about the causes we have discussed, a ball shuttering the window does not change much except for the window. The window maker not producing the window makes of a larger different because then we would have the house with one window missing for quite some time, which would change a lot. Parts of the house would get very wet for instance, would be cold or hot depending on the climate. If the house maker had not built the house, the change would be even bigger. So, what you want is an actual cause that assumes and changes as little as possible, which precisely what the ball is in here. For debbugging the same hole, you want to have a cause that assumes and changes as little as possible but still produces the effect, and that's precisely what delta debugging gets you starting with the general cause the entire input reproduces an actual cause namely a subset of the input in which everything is relevant for producing the effect. Here's a quiz. I invoke the GNU compiler, and it crashes. Which of these are general causes for the failure? Is it me because no me, no invocation, no crash or is it Richard Stallman, the founder of the GNU project because without Stallman no GNU compiler no crash or is it oxygen, without oxygen, no Richard Stallman, no me, no compiler, no crash, probably also no world as we know it, no computers, no electricity, interesting, or finally is it a bug in the compiler because no bug no crash. Check all that applied.

27 s Actual Cause

Back to the answer, well formally speaking, all of these are causes because if I don't occur, the crash doesn't occur. If Richard Stallman doesn't occur, crash doesn't occur. No oxygen, no crash and no bug, no crash either.

28 q GNU Compiler

But now for the extra question, which of these is actual cause for the GNU compiler crashing? There's is only one. Go and pick it.

28 s GNU Compiler

Again, that should not have been too hard. The answer is, of course, a bug in the compiler. We can check all of these possible causes by the changes they have. No me would have some effect--well, you wouldn't be able to listen to this course right now. Richard Stallman--no Richard Stallman would have a huge impact. The GNU project is the source for so much great open source software out there. We would have a very different role. Oxygen is of more important. I can't even start to begin to imagine how our world would look like, and of all these courses, fixing a bug in the compiler has by far the smallest set of assumption and also the smallest effect because apart from fixing the bug and apart from fixing the crash, nothing else changes and therefore, this is the actual cause.

29 q Leveraging Input Structure

Before we get to today's homework, let us see how we can make delta debugging faster. We must realize that delta debugging is a rather dumb algorithm where dumb means it's rather economical in terms of assumptions. It assumes nothing about the input structure, which requires little, but on the other hand, it gets the job done. The interesting thing is that the more knowledge about the structure of the input you put into delta debugging, the more efficient delta debugging becomes. Let me illustrate this by an example. When faced with a 20-character input, delta debugging would have to split this into two halves in the beginning. But since delta debugging doesn't know about the structure, the division will be right in the middle of the input. If delta debugging knew, however, that the input was html then it may come up with the split that would occur at the boundaries between htlml tokens and actual texts something like this for example. And if you repeat this now for the first substring, you will see that we were able to simplify the input much faster because now our simplification follows the structure of the input. Let me illustrate this with our example. Here again, we have delta debugging as is, and we're applying it on this very input as we just saw. When I press on run, we cannot see the individual runs. And in the end, one character after one character is being removed until we finally see after 29 tests and one assertion we see the final result, a simplified input select. How can we make delta debugging aware of the input structure? We have already seen that delta debugging works on lists as well as on character strings. So what we could is we could split the string into individual substrings according to the input structure and have delta debugging work on that list of elements instead. This is such a list of elements. First, we have the select tag, then foo, and then the end of select. Now, what we might have to do is we have to adjust our test function such that it will merge the individual parts of the list back again into a string, which we do up here. We're using the Python join function, which takes all the elements in the list and concatenate them with the first string as separator. In our case, an empty string. So the effect of this is that all the elements that are in s right will be merged together. And then, in the reconstructed entire string, again, we search for the selector. In our initial setting, it took us 30 tests until we minimized the input. So, on characters, we have seen it took ddmin 29 tests again excluding the assertion to simplify the input. How many tests will it take on a list of tokens that is on this list of tokens? And here a token is a substring with a decisive meaning. You can try this out for yourself or just estimate. Is this 4 tests? Is this 6 tests? Is that 12 tests? Or is that 29 tests? Over to you.

29 s Leveraging Input Structure

To answer this question, all we need to do is press on run and you'll see that now we only require four tests. You see how delta debugging splits the list, quickly finds a sub-list that is just two elements, checks each of them and finds the one element that causes the test to fail. The correct answer here was 4.