What we're going to talk about are more advanced issues in random testing. And so we're going to talk about things like test oracles. This is always a big issue when we are using random testing. We're going to talk about how random testing fits into the bigger picture. That is to say how ideally a random test that you develop will sort of co-evolve with the software that is testing and so they can both strengthen each other over time. We also going to talk about perhaps the trickiest topic of all in random testing, which is tweaking and tuning the rules and probabilities involved in making a random test case so, that they actually do a good job. Together, these issues and some others that we are going to cover, constitute what I would sort of think covers kind of advanced testing issues. And what these are going to do is give you a basis for creating really strong random test case generators.
So I'd like to talk a little bit about how random testing fits into the larger software development process. I'll start by asking a question why is it that random testing works at all? I don't think the answer to this question is actually known in any kind of a detailed sense, but we do know parts of the answer. And a big part is random testing is based on very weak hypotheses about where bugs live. And what I mean by this is, every test case that we've run is a little experiment. The input into this experiment is a test case and the outcome of the experiment is a bit of information about whether the system worked or didn't work. So what random testing does is it makes the conditions under which we're running the experiment weaker. We don't have to guess about some particular thing that might fail. What we're rather doing is guessing about a whole class of things that might possibly fail. This turns out to be powerful because, given the very complex behavior of modern software, people don't seem to be very good about forming good hypotheses about where bugs lie. The second reason is people tend to make the same mistakes while they're coding and while they're testing. So, for example, if I forget to take into account some sort of a special case while I'm implementing the system, I'll probably also forget to test it. If I forgot to implement some feature, or I misimplement some feature, I'm not going to test the stuff that I did wrong because if I was able to think of the test case I probably would have got the feature right in the first place. Random testing to some extent gets us out of this problem because it can construct test cases that test things we don't actually understand or know very well.
The third reason is there's a gigantic asymmetry between how fast computers are and how fast people are. So even if the random tester is mostly generating stupid test cases, if it can generate a clever test case maybe one in a million times just by getting lucky, then it still might be a more effective use of our testing resources than writing test cases by hand. And there has been a similar finding in other areas. For example the game Go has an extremely large state-space and it was traditionally thought that computer Go players were just never going to be very good — the game is just too hard. What turned out to be the case is that today's computer go players are quite a bit stronger than previous ones, and are based on what's called Monte Carlo methods which are simply randomized methods for exploring small parts of the Go state-space. And it turns out that this is often good enough to create respectably strong players. They're still not nearly as good as the best human players but they do pretty well. This is exploiting a very similar insight to random testing where we probabilistically explore these spaces that are very large and even if we can't achieve any kind of meaningful coverage, we can still often find interesting things, especially if we have these extremely fast 8 core machines that are really quite cheap today.
Now I'd like to ask a slightly different question which is why random testing is so incredibly effective on commercial software, at least sometimes. But what I don't want to do here is start some sort of a debate about whether random testing is or isn't effective, I think there's pretty ample evidence. So for example, the fuzzer papers that we discussed yesterday or the talk that you can watch online "Babysitting an army of monkeys". People have shown that random testing really is effective on commercial software. Why is that? I'm going to give some opinions and of course you should feel free to disagree with these.
And the first reason I think is because the developers of the commercial
software systems aren't doing this, or at least not doing it enough.
I think it's pretty clear that the kind of bugs found for example in the fuzzing papers or in Charlie Miller's talk wouldn't have been there in the software if the developers of Adobe Acrobat or of the Unix utilities had a reasonably aggressive random testing program. And remember that some of the bugs Charlie Miller was talking about were security vulnerabilities. These are things that they really don't want in their software. And so what I would argue is that software development efforts that don't make proper use of random testing are flawed. And the reason that these efforts are flawed is because modern software systems are so large and so complicated that test cases produced by non-random processes are unlikely to find the kind of bugs that are lurking in these software systems. And so what that leaves us with is a question: what should they have done? How is random testing supposed to work? So let me give some ideas about that. Here's sort of a rough software development timeline. We're releasing software here and early development stage is over here. What we've looked at mainly so far in this course is random unit testing. We're developing software units and we're trying to make sure that they're robust enough that when we start composing them together later they'll be a solid foundation. So we've looked at several cases, for example the bounded queue, and looked at fuzzing the interface it provides. We also looked at random fault injection and that was for the read_all function. That was a function that was supposed to cope with the fact that the Unix read system call can have partial success. They were doing fault injections, so we were fuzzing the interface used by the read_all call, not the interface that it provides.
What we want to do is ensure as we're developing the modules that we're creating robust pieces of software whose interfaces we understand and they're gonna be a solid foundation for future work. So as we start developing more elaborate software stacks, it's going to be the case that some of our random testers become useless. So, for example, if we have a queue instantiated here that's used by some more sophisticated piece of software we no longer are interested in the ability to randomly test the interface provided by the queue because its simply being used by the rest of the software. On the other hand, other kinds of random testers such as those that come in at the top level and those that perform system level fault injection are absolutely still useful. In fact, both injection of things like erroneous responses to system calls are really important things to test larger piece of software with because typically these kind of errors can result in failures propagating all the way through our software stack and we'd really like to understand how our system responds to that sort of thing. As we approach something that's more of a complete product, our focus should be on external interfaces provided. This is going to be things like file i/o and the graphical user interface. If you recall those fuzzing papers were fuzzing exactly these sorts of things. They're delivering random bits through the file interfaces and they were delivering random GUI events to applications and knocking over a pretty large proportion of applications they tested.
So what we want is to build these system level random testers as early as possible in the development process. There are a number of reasons for this. First of all, what we'd like to do is start off with a simple version of our system that doesn't implement very much functionality and then we want to use sort of a weak fuzzer that is to say we will test it with values that are maybe not that interesting. And the good thing about this combination is, these weak random tests probably are going to find some flaws, but what they're not going to do is flood the developers with huge numbers of bugs like what might happen if we used an extremely strong random tester. There is no easier way to demoralize software developers than to hand them a really big pile of bugs. Nobody wants that. What they're going to do is ignore those bugs and get back to getting work done. What you want to do is give people a slow but steady stream of important bug reports and let them fix these as they go on. There is another reason why it's nice to give people a continuous stream of bug reports which is that they help show us flaws in our software, they show us interfaces that we don't understand, they show us modules that end up being extremely weak for one reason or another and they basically help us better understand where our software development effort is going wrong. If, instead of giving people maybe a couple of bug reports a week for a year, we give them a hundred bug reports at the very end all they're going to do is triage to find the five most critical bugs and fix them using hacks. Nobody learns anything, everybody's angry, nobody's happy. What we rather would like to have done is have been doing random testing all long and using it to spot weaknesses in our software. The other thing that happens is, as our software evolves to be more robust, as we move towards releasing it, we're evolving our random tester to be stronger and stronger. Maybe this week we had a feature where we generate a new kind of random input that we haven't generated before and now this is going to generate some bug reports and we'll fix them, and our software evolves to be more robust. If we keep doing this not just over weeks but over years what we'll end up with is a random tester and a system that have sort of gone through this co-evolution process where they've both become much stronger. We've evolved an extremely sophisticated random tester and we've also evolved the system to be robust with respect to the kind of faults that can be triggered by the random tester. So I firmly believe, although I can't prove this and you're free to disagree, that if, for example, Microsoft had done this from the beginning, Adobe had done this from the beginning and other companies; they end up with lots of security vulnerabilities and had fuzzed their products all the way through the development chain they'd end up with far fewer of these sort of nasty crashes and critical security vulnerabilities that they're always scrambling to patch and that anybody with a fuzzer seems to be able to find without a whole lot of effort. Or at least that's been the case in the past and it's possible now that with more widespread use of more aggressive fuzzers that kind of era of easy security bugs in popular products maybe is hopefully starting to tail off.
So we just talked about random testing in its larger context as part of the software process but what I'd like to talk about now is one of the more advanced and frankly difficult parts of building a really good random test case generator. So as I was saying a little while ago the best way to start random testing is start early in the software process and start simple. And what's gonna come out of a simple random test case generator is often a collection of testing results that are not particularly great. We get shallow coverage, we violate many constraints on valid inputs, and overall doesn't work all that well. It probably doesn't find all that many bugs unless we're testing a particularly weak system like for example Unix command-line utilities from around the 1990s. So the thing to do with these results is to look hard at the test cases. Think hard about how they relate to the overall shape of the input domain for the system. So for example it may be that we're generating a super-set of the input domain, that is to say we're generating invalid inputs, but it's also likely that in many cases were generating a subset of the domain. That is to say, we're generating test cases that fail to explore some parts of the behavior space of the software under test. That's going to miss bugs. We should definitely look at the effects the random test cases have on code coverage. This is extremely revealing because as we've seen it's really easy to write a random test case generator that gets very shallow coverage of the software under test. The next step is to adjust the rules that are in effect for generating random test cases and to tweak the probabilities. If that sounds very vague at this point, then that's good because it is vague. What I'm gonna do a little bit is go into some specific examples. Or course we can always add functionality to our random tester. And as I was saying earlier, it's often good to do this in a very slow and incremental fashion so we avoid overloading people with bugs that they don't have time to fix. This is an iterative process that doesn't stop as long as we're using a random tester. We should always keep looking at it and trying to evaluate how we can better adjust the random tester to find interesting things out about the software we're testing. I don't mean to say here that this has to take a lot of time. Something you can do is just anytime you have time every couple weeks or whenever just take a look at the test cases. Take a look at the coverage. Think about things a little bit. See if anything has changed in the software under test that might make you want to tweak the random tester and then basically you just put it back into production.
So what I'd like to do now is look at a few examples. The first one is filesystem testing. We've already talked about this a little bit so I'm not going to re-introduce the subject. So if we start with a simple filesystem tester what we're probably going to do is make a list of all the filesystem calls we would like to test. So we can mount and unmount a filesystem, we can open and close a file, we can create a directory, remove the directory. Make a list of all those kinds of functions and basically call them randomly with random arguments. We'll look at the results and think about them and what we'll see is that probably our testing is highly sub-optimal. Because, for example, if we're throwing unmount calls randomly into the mix, we're going to end up on average operating on an unmounted filesystem fifty percent of the time. Similarly if there's no correlation in randomly chosen file names between different calls to open close and read and write and such we may effectively never actually perform read or write calls on open files. This is obviously undesirable. What we're going to do then first is special case mount and unmount. We're also going to need a special case for open and close. One thing to do is keep track in the random tester of the set of currently open files. Just these couple of simple things is probably enough to get a filesystem fuzzer off the ground. But there's more. We're going to limit the size of files so we don't waste a lot of time reading and writing many many gigabytes to files when we could be using that time better. Perhaps we also want to limit the height of the directory hierarchy so we don't generate incredibly deep directory hierarchies that are not interesting. On the other hand, we may want to do exactly that. We might want to exactly test extremely deep directory hierarchies or extremely large files but it might be the case that these are special cases that we will test separately from the main body of our fuzzer. So I hope the general pattern is clear here. We start with something really simple, we probably observe this doesn't work very well, and then we start special casing things in order to remove limitations of our random tester over time so maybe over weeks or months or even years this process ends up with a random tester that will be extremely strong.
Let's look at another example. This time we'll be quite a bit more specific. So we already wrote a random tester for the little bounded length queue data structure. One thing we might want to ask ourselves is does that fuzzer do a good job? If you wrote a tester that found all five of the bugs that I seated in the queue data structure then you probably did a pretty good job. But let's take a look again. We're going to look at the queue as a finite state machine and we're going to see what kind of states we can drive that machine into using the random tester. We're going to start off with an empty queue and fifty percent of the time at this point we're gonna make a dequeue call which is gonna fail. Fifty percent of the time we're gonna make an enqueue call on a queue containing one element. Okay from here it should be pretty obvious what could happen. We can dequeue something going back to the empty state or enqueue something going to the two element state. Let's assume that that's full. So the dynamic process that we're gonna get when we run the tester is some sort of a random walk around this finite state machine. What we want to ask ourselves is does this random walk have a reasonable probability of executing all the cases? And so here probably the most interesting cases are dequeueing from an empty queue and enqueueing to a full queue and then walking around the rest of state-space. What I've done is rigged the random tester with some extra statistics keeping so it can tell us about this a little bit.
We saw this a little earlier but I was trying not to call attention to it. But what happens is when we add something to a queue which is not full, we increment a counter. Similarly we increment another counter when we add something to a full queue. We also count when we remove from an empty queue and from a non-empty queue. So let's run the random tester. Okay here's what it tells us. It did thirty three thousand and some adds to a non-full queue, sixteen thousand and some adds to a full queue, 33 thousand removes from a non-empty queue and 16000 removes from an empty queue. So we can see that, at least as far as that really crude coverage metric goes, and of course we could've got similar information using a branch coverage tool but this is a little easier and more general purpose, we've gotten good coverage of our queue. But I'll try something different. Imagine instead of the state machine having just three states let's say that we had a queue containing two hundred elements. So the dynamics are gonna be exactly the same except I'm leaving out a lot of nodes in the middle. We're still going to start at the empty state, we're still going to randomly walk around our finite state machine. My question is, are we going to get good behavioral coverage of our queue in this case? Let's take a look.
Let's look at a very similar case where again you visualize the execution of the queue as a finite state machine. Now we're going to have a much larger queue we're going to have a queue that stores a thousand elements instead of two. So we're going to have exactly the same kind of state machine but its shape is changed. Instead of 3 states it has 1001. So when we can randomly test this queue we're still gonna start at the empty state, but let's look what the dynamics are this time. We have a queue with N=1000 and nothing else has changed. Run the random tester. This time we've done around 50000 adds to a non-full queue, no adds to a full queue, almost 50000 removes from a non-empty queue and 10 removes from an empty queue. So let's try to understand what happened here. So what happened is, and I'm going to come up with another view of that state machine, so here we have the empty queue and here we have the full queue and here we have the number of of times we visited a particular state. And what's happening is a fifty percent probability we're randomly walking back or forward and we're going to be getting a situation where the probability of visiting key states farther away from where we started drops off exponentially. Actually there are good closed-form equations for computing the probability of getting to any particular point. But as we saw using a hundred thousand tests I believe we never actually managed to make it to a thousand (items in our queue). Although we did perfectly easily make it to a queue size of two. A thousand isn't a particularly large size for a data structure, we could easily have a queue that holds ten thousand or a hundred thousand elements and the chances of randomly walking all the way out to the end become even more negligible. So the question we have to ask ourselves is what do we have to do differently to a random tester to make sure it tests this situation (1000 length queue) as thoroughly as it tested this situation (2 length queue). There's no cut and dried answer to this kind of question. These kinds of questions are definitely hard in practice. Unless something changes about the software under test we might have to adjust the probabilities to compensate. Let's look at one possible solution. One possible solution would be to bias the probabilities towards enqueuing. So if the random number generator returns a number less than zero point six we're going to enqueue and the other forty percent of time dequeue. So let's see what effect that has. This time we attempted to add to a full queue a large number of times (~19000) but this time we seem to have not particularly well tested the case of removing from an empty queue (1 time). Let's try running a couple more times and see what happens. This time we didn't remove from an empty queue any times so we biased the probabilities too far towards adding. What's gonna happen is in any kind of a random walk, as long as the probabilities are respectably balanced, it's going to take a long time to get somewhere unless we unbalance the probabilities significantly like we did with a sixty forty distribution. What I'd probably do in this case is make a configurable bias to the probability and now add a completely new random testing loop. Initialize my bias variable to be something sort of large unless x is odd. If x is odd we set the bias variable to be something less. We took our testing loop and enclosed it in a larger random testing loop and in the larger testing loop we sort of made a qualitative change to one of the probabilities. That is to say we've biased execution towards one of our API calls in favor of the other. On even-numbered calls we bias it the other way. So what we're hoping to do now is create a situation where we start off in the empty state, we migrate with pretty high probability towards full and bounce around there, and at the end of that particular configuration of the testing run we're going to walk back and we do that twenty times. So what we hope is that this will test even further a fairly large queue. So we'll see if this actually happens. Of course this is gonna take longer to run this time.
So with this bouncing back and forth behavior, we've done a good job of testing a large queue. If you think about it, this idea of adding a new outer testing loop to a random tester is often a good one. So this is probably exactly what we would end up doing for the file system example. If we special case mount and unmount, what that's going to mean is, hard-coding a call to mount at the beginning, then executing a bunch of API calls (opens, closes, mkdirs, etc.), and at the end we hardcode an unmount call. What that fails to do is interesting stress testing of the mount and unmount calls. If we include mount and unmount in our random API calls that will be too many calls to those functions, but if we hardcode only one at the beginning and one at the end, that's probably too few. What we can do is mount the filesystem, do a bunch of stuff and unmount it. Then enclose that in an outer testing loop to make sure that we stress all the parts of the system that we intend to stress. While the state machine for the filesystem is more complicated than the queue, we can still do something similar and we'll find that we need to adjust our probabilities. Since this is a deep topic and a difficult one, let's look at one more example.
We're going to look at testing bitwise functions. So we want to write two python functions called high_common_bits and low_common_bits that take two integers a and b as inputs that are the same size, say 32 or 64 bit integers, and return an integer with the same bit width. For high_common_bits, the return value consists of:
So the two highest bits of a and b are the same, so we set the same values in the output. The third bit is different, so we set that bit in the output and fill the remaining lower bits with zeros. The low_common_bits function does the same except it starts from the lowest bit and moves left. These examples might seem silly and contrived by a professor to mess with students, but that's not the case. They come from an optimized 'trie' implementation, which is a balanced ordered tree similar to the splay tree we looked at, but it uses bitwise operations to find the descendants of node to get high performance and low memory footprint.
Solve that was just introducing these two factions and so the deal is is that it's really easy to write a report code for this so stick record secures high common debts all we're gonna do it is organ do is a walk through the bits in the interim here i'm assuming a sixty four-bit and uh... we'll walk through the internet reversed so from sixty three zero and everytime we see tube it's being the same in a and b_ we copy that had to be a purposes we see two bits at the same position that are different and be we said that a bit position the output and return it and low com and this is just the same the only thing that's different is removed from zero to sixty three instead of sixty three a zero but we do exactly the same thing or copying bits in the output as long as they're common between andy finances we see a difference setting the bitten returning sort heavier is a really slow implementation so even if we translate this implementation straightforwardly to see it still to be really slow compared to optimized implementations so optimized implementations of this artillery potentially more complicated potentially hernia right but they might execute in just a handful of clock cycles as opposed to maybe a hundred clock cycles for the optimize c versions the probably thousands of cox autocomps cycles for the python so course performances are going to be our concern here who's using this it illustrate initiative comes up in extremely optimize tractors so the fact or trying so hard to optimize these codes means we're gonna have some concerns about whether correct or not so struck by how we're gonna do with that for powdered is right around in tester the reason we're gonna right around tester is with sixty four-bit inputs input demand for either of these functions is gonna contain two hundred twenty elements so sparked a day to exhaustively test we're gonna be forced to do some sort of either systematic testing a random testing there's no way out of partial testing of these functions and so the most obvious kind of random test at a rate would be a they karan in sixty four-bit integer like another one now assert that her super super optimize high common bits function call with andy returns the same result as reference implementation positions of the python version this is sort of simple incredibly office code and maybe a reference implementation see it doesn't really matter brittle we're just going to randomly test all sorts of configurations of the sort of dysfunction to the question to ask is is a good random tester and born i do is run the could coverage tool so scott ended up.
Okays this whatever and this is going to and also that implementation python and we're gonna see as the since all i have as the reference implementation not actually any testing to repetitions against each other rather onto doing is making up to women sixty four-bit managers and just running a high comment this and low crime and its functional to reduce redundancy code coverage but not actually checking the outlook for correctness and i'm running a hundred thousand tasks of so let's run this under the coverage highness conceited cm asking her for branch coverage turned off about but and the courage should have been computers when we see here goes point by statements around twenty three of them couple statements are missing and we had couples partial branch coverage results which position here is we've executed the body of the loops and we've returned out of this path we failed to take the exit branch from the loop and we failed to execute the return statement the same thing has happened over here so fast results why completely random testing with valid and votes a hundred thousand times did manage to cover this of course is pretty easy to see so returning to totally independent ran in sixty four-bit numbers and the only way we can reach this case is a fair the same so what are the odds of two randomly generating sixty four-bit and it is being sent there extremely low compared to a number of test cases running whenever detest this case protesting optimized implementations of these functions section i was quite a bit more would drop blood implementations of these functions pretty use specialized instructions the modern architecture support for providing that counts mister boyle down to at least if a reason gcc as a compiler functions like this built-in seals the hand built in cities e which is you can see seals the returns number of leading zero that's an extorting at the most significant deposition and we can use this to implement heard one of our bit functions by excellent hitherto operands to get tom to turn common debts into leading zero bits and so we can do is he's an x or operation to turn higher order common bit inner are in our arguments into reading zero beds mendini uses built-in feels the gentleman extremely fast version of the high com and its function and similarly was built in cities e can be used to build an extremely fast local mandates and so but if you see here effects is a row that is to say if the two inputs are equal the result is undefined so the implementation here's a lot to do something really weird in the case where we pass into our brands are the same but as we saw mobile random testing here has only a negligible chance of actually generating two ari moustache in the same so it's really a battering tester for this case pls go back and try to do better.
About a random tester for these functions would set off making a completely random none randomly clipping random number of beds cigarette maker random number between zero and sixty-three and put them into this we can make a much better and tester for this particular code mimicking a random than they could be a mutated function of a let's look at how to do that further still make a totally random parents who want to do is change a random number generator and here are really slowing it down by adding an absolute such change the number of total rent esta ten thousands of a hundred thousand as for one of those is still completely randomly generated sixty four-bit number an organizational as be to be a an effort to in zero to sixty three for ten zero six degrees so that is for some random number between zero and sixty-three reflect the sense using python sex or operator of one ran a bit of the entire and our new test case is going to be a and and change original day with the idea that that that's gonna be a better testcase footsie if that is indeed the case tonight was completely covered parker replaceable statements all of them around but no partial executions.
Due to their structure we only executed farther into the four lives in a few minutes functions wind its for the same so if every bit was independently different then the possible probable effects execute farther into his functions dropped off exponentially and so sixty four-bit so probably dropped off to fast first-ever execute the case bring beer at the same so based on their knowledge based on some domain-specific knowledge men and women tester the by flipping around a number of bits they're much more human exploration of the iterations bases of those loops including reaching the ending state which was on the gcc documentation was a state the reaction of the test and so who's had a pretty elaborate exercise for what in the end of our two pretty simple functions with this just shows is that you can do ok with a really simple random test our not understanding what's going on which one do a better job we really need to think about what it is the word test and how they could a structured and how we're going to execute all the way through it an assistant and as a sort of a funnel imitation of random testing.
Okay now i'd like to go back to the drawing of a software under test i feel like i'm john about three times already amber call that thus provides a penalize for the most of the finals or father when the other hand who suffer and test uses a pianist like phoenix recall we deposited this levels well and as most often called fault injection data and ran a poll today chin as we discuss is extremely powerful so what that examples of both of these and both of them are incredibly useful so it's not like to talk about now is closing of the implicit invokes we talked about in unit to where we have not a p_i_ inputs to the softer under test that affect its behavior it turns out that this is often pretty important and so how would we do this well as mentioned earlier possibly the most important implicit input is the timing i wished different threads around different processors so this is a the thread scheduler provides a very important form of implicit infant tumulty thread id software under test it's an extremely important to me is determine the schedule as far as i know there isn't a single best way to do this and so we find in practices people do a lot of different things one thing you can do uses generate load on the machine that is to say make sure that your testing is not happening on a clue that a machine that's often the case that if you test software on a completely idle single core machine even if it has multiple threads the scheduled those threads will be scheduled equate deterministic way similarly if you have a multicore machine that has more processors than your application of threads is again a strong possibility that will be scheduled an extremely deterministic fashion while you're doing testing after being schedule extremely deterministic lee that means that the scheduler is exploring only a very tiny fraction for the full set of possible threat schedules and that means is concurrency bugs are going to go on founder and testing so generating load by running apts generating load by writing other applications is important generating network activity on the machine can also be valuable network activities interesting because incoming network text machine because in our cameras to fire the cause of the crow routine stuff to to run because cast lines to be stolen away from application and discover supervision of the schedule there also exist specialized tools for stress testing of multithreaded applications and a couple of the billable to we've been extremely good idea to use it to help tester multicolored occurred okay sort of discovered a slight below call external perturbations to the threat schedule also possible free software deduce perturbations internally when we do that is to do things like concerning the latest before and after acquiring rocks also around accesses to shared variables this is in sales offer the high cost this isn't so easy but on the other hand my uh... might be useful liquor store one time with about a group of researchers who are working on formal methods based tools for finding bugs inbred codes and his researchers were having some sort of a competition where each of them try to find the most bugs and software to test with a cut and what happened was somebody showed up in just a bunch of sleeps inside the application and show that they could find just as many bugs as these super heavyweight super complicated formal methods tools so final thing you can try that's what i'm gonna call unfriendly emulators but that in quotes that now but i think what's happened is making a pet name for these are is a kind of has got a special virtual machine for other runtime specially designed to stress test your application by doing things like invalid in cash lines invoking threats couples and always and other things and so person this kind of things generally available if you have that one of them fits the needs of whatever software in a testers you should definitely is.
Next issue we're going to explore basically purely so my own views and opinions but next issue regards for is the one is a question can random testing his inspire confidence unsolicited earlier in this course is the purpose of testing as to maximize the number of bugs found for amount of effort spent testing and so that's really what testing us all about and the other hand when we should start to do a better and better job testing especially zing random testing we start to get tempted to use those results as justification to believe we're creating good software won't talk about now is is that ever justify is that something the weekend as an inference that we can make and i think the answer is is a sort of a highly qualified yest but i think i can give his for the highly qualified yes to the yes answer to this question pls go through the qualifications they've been together we understand really well and we have relatively small piece of code so for example are doing unit testing sovereign uploading unit testing of the data structure who data structure like a balanced re and we have a strong and well chosen collection of assertions embedded in that code we have a mature well tuned random tester and we've measured coverage and shown the coverage is good themselves in some cases i think we can conclude this offer is pretty good what i mean basically by this is the michaud or any of you could take this plate really looked at for the q are red lectures something do all of these things the middle of a reasonably high degree of confidence was cracked up any of these conditions and that i would be strongly down with a random testing can inspire confidence so if you don't have a smoker but save it and so small could we have adobe acrobat reader there's no way even if we have all these other things we could possibly any confidence in the quality of the software using random testing so for example if we don't have a smoker if we had for example adobe acrobat reader then even if all of these other things were true there's no way the random testing would inspire confidence in the quality of the product and so the thing we should keep in mind is since these conditions for inspiring confidence are quite restrictive we never just a random test we always use it to augment other testing methods that might be better at shoring up some the weaknesses of random testing.
Let's look at some tradeoffs in random testing and what I really mean here is advantages and disadvantages of random testing that might lead to spend more less time doing it and so and the side of the advantages over here are my thoughts in just a tad sort of the main disadvantages of random testing us versus a something spell out of time already dealing with student of the only problem can be really hard a lot of people give up on random testing in my opinion give up on it too early before they've gotten good results and often it's a simple billeted problems isn't the deliverable but stop stop but the fact is i might be some creativity required to get around as for example mutating existing inputs getting a good enough understanding of structure of the input we can actually generate valletta random inputs from vocal locals are hard to this assessment of creativity to deal with the lack of oracle's full of people often like when doing testing villa telefax test suite an electorate now like to know when it's done if they didn't get any failures they can feel good random testing on the other hand has no stopping criteria and so for example if you're on a random tester on a chorus for a day without running into problems implies nothing about what you find in the future you might find success and problems the next hour of testing and the process of doing this compiler testing but i've been doing for several years now i've found this to really be the case historical protest and on for example is version of g_c_ held down bill won't find anything for a long time off into my salty they've really sort of sort of male b issues and then this to the probabilities working out how they do a bunch of problem out after their trip to another problem muslim we have a really talked about yet because the random testing may find unimportant bugs what i mean by this is that since random testing explores parts of the impediment that may not be often explored by real users of the system you could be the people developing software test are not remotely interested you fixing the bugs are triggered by the stringent us and so this can be frustrating there is really just very situational some kinds of bugs such as buffer overflows in webserver or n_p_f_ readers something like that almost purple more are almost mandatory to fix because this off exploitive ball so those kind of bugs so those come about really can't be labeled unimportant by raping developers and the other hand if ur talking about proms the really might not come up in practice let's say that was my c compiler testing would say the president are testing a generator c function that's a hundred thousand miles long and spells to compile well the korbo developers are almost certainly didn't go into an army because people just over a cold like that and similarly if i make an identifier versus a variable name the different characters long in a crash was the compiler they may not care about that either and so the general flavor of the solution to this problem is first of all to generate random inputs they're more like it was of interest so for example when regenerating rhiannon see 'cause we don't make really want functions we don't make relavant identify remains mask print this is really deeply they can easily be the seem to find bugs so we just don't really have a confidence that we find his bones that anybody would care about fixing them two brothers nine and find them.
The second part of the answer to this problem is to do a good job is as the person running the random chester you do a pretty good job filtering the bugs that you find so that the only pass the ones that look important ones look like there might be critical or security problems are something onto the developers if you find a bunch of us that they never see this isn't a problem if you find a bugs if you find a bunch of bugs that they had that they're never going to see this is a necessarily a problem uses your time but easier time so if you find a bunch of bugs there and that you never pass on the developers this isn't necessarily a problem people had interesting things about the software in a test if you would if you make the decision to not passes on that's probably okay uncommon for simple random tester spent all this time performing broadcaster for example if we do about job generating h_m_o_ then maybe all of our test results and rejection by these two mile purser when ever find anything interesting an end to responses to this kind of thing one response a simply put more resources into the random test for example we can use more chorus wikia's machines with more memory momentum to make random testing faster by reducing the size of the inserts the regenerating the other response as we discussed several times is to make the random test case generator smarter that is to say to use our human expertise or human knowledge of the domain of interest in order to generate manifest cases darbar winchester may find the same bug many many many times and if we can automatically reject repeated instances of a bug this is our problem you had if we don't have an automated way to do tree out of issues that we found things can really way somebody style this initial talk about explicitly a little bit later.
To the problem is that it can be very hard to do the bugging in the test case is a very large andorra nonsensical that's what i mean here is that that ran a test case the triggers a failure in a softer in some software in a test buddy many megabytes and they have no sort of usable human interpretation and of those cases the muggings offering to test to be really hard no solution to this there's a there's an pretty good solution to scramble to talk about a little bit later so final disadvantage random testing is that every father my final dissident of random testing is the people's experiences that every time you read a new browser defines different books with this basically means is little incidental design decisions and ran a test case generator explore subtly different parts of the system under tests impediments that results in fighting to provide them so the shouldn't be surprising amid dream difficulty of the testing problem there's still a little bit moralizing when you spend a lot of time running a father it finds a lot of bugs and you think the system is getting pretty robust and then somebody else writes a really simple father defined so much about it you can find and said this is very common for that actually happened so that's a long list of drawbacks drawbacks are real let's go back and look at the other side of the trade offs sony advantage side random testing has left tester bias meaning that testing is less influenced by what i know about the systems implementation and random person also has we care about the head also has weaker prophecies but were bugs are maybe this is just another way to say the same thing so besides this issue of bias and hypotheses perhaps the real killer advantage of random testing as one-c_ implement a random tester who want to automate the testing process the human cost of random testing is basically zero so this is certainly been the case with the c compiler fuzzy effort mentioned several times where we spent really spent a long time developing this random testers we've worked really hard on it but since we more or less tension up a couple of years ago it's a finding on average about her to a week in real compilers but almost no effort from soda machine singin darkroom somewhere doing random testing but it's run by itself it needs almost no oversight all we have to do is look at the results and write up some bug reports please process of turning random tests in the bug reports is a man or come back to a little bit later so extremely interesting and useful thing about random testers is they often surprised for the personal experience of being surprised by random testers many times but also and of course of reporting about four hundred fifty compiler bugs a found the compiler developers are often really quite surprised about the behavior of virtual sarandon testing sort of consistently has this ability to tell us something we didn't know that's extremely valuable another advantage is that every positive finds different buxton so of course i agree that under disadvantages well the sort of the pressing on some level but on the other hand this is really cool so this is what we have several puzzles oldest of a couple of course teaches them and more than likely does not even be that much overlap in the box down by the different esther finally due perhaps to some combination of these other advantages but at least by brenda testing to be really fun spine the house up automated it's fun to be surprised about the system on their shortcomings in our logic pointed out by a process that doesn't have a human involved is really just interesting to have this happened and until it is interesting to this kinda stuff happens.