These are draft notes from subtitles, please help improving them. Thank you!
Okay, we finally got into my favorite testing topic, which is random testing. And random testing just means that test cases are created at least in part using input from a random number generator. This became my favorite testing method a few years ago, when I noticed though without even realizing it, I've written a random tester for every piece of software I ever wrote where I actually cared about its correctness. So, I've written at least a dozen random testers.
So, let's look at how this works. We have a random case generator. The test case generator takes as input pseudorandom numbers, so PRNG here stands for pseudorandom number generator. Of course, on computers, we almost never get real random number generators. What we rather get are pseudorandom numbers which are algorithmically generated but for purposes of creating test cases, they're good enough. The nice thing about pseudorandom numbers is they're repeatable. So, when we start using a pseudorandom number generator, we give it a seed which completely determines the sequence of random numbers it's going to generate. So, if we ever want to generate the same random numbers, all we have to do is remember the seed. This is actually nice because we don't have to save of voluminous random tests. We can rather just remember what seed we started with if we ever wanted to get back a particular set of random tests. The other thing that goes into a random test case generator, usually to make a good one, you needed a significant amount of domain knowledge. By domain knowledge, I just mean that you have to understand some properties of the software under test. We're talking about quite a bit of detail about what form this domain knowledge might take and how you encode this as part of the random test case generator. Generated test cases come out of the random case generator and they go into the software under test. The software under test executes and produces some output. The output is inspected by a test oracle. The oracle, as we've already learned, makes a determination whether the output of the software under test is either good or bad. If the output is good--that is to say, if it passes whatever checks we have, we just go back and do it again. On the other hand, if the output is not okay, we save the test case somewhere for later inspection and we go back and do more random testing.
And so, the key to making this all work is, wrap the entire random testing tool chain and some sort of a driver script which runs it automatically. What we do is we start the random tester on some otherwise unused machine and we go and do other things or we go home. And while we're doing other things, the random testing loop executes hundreds, thousands, or millions of times. The next time we feel like seeing what's going on, maybe coming to work in the morning and we basically just see what kind of test cases have been saved. If anything interesting turned up, we have some followup work to do, like creating a reportable test case and debugging. If nothing interesting happened, then that's good. We didn't introduce any new bugs and we can rebuild the latest version of the software under test and start the testing loop again. Generally, the way random testing works is, it's not necessarily part of the test suite for the software under test, but rather it's a separate testing loop that gets run independently, acts as a separate or an external quality assurance mechanism. If the random test generator is well done, and if we give us a sufficient amount of CPU resources to the testing loop, and if it's not finding any problems, random testing can significantly increase our confidence that the software under test is working as intended.
And it turns out that in general, there are only a couple of things that are hard about making this work. First of all, it can be tricky to come up with a good random test case generator, and second, it can be tricky to come up with good oracle. And of course, we've already said that these are the hard things about testing in general, making test cases, and determining if outputs are correct. Basically, the same thing is the case here, but the character of the problems that we run into while doing random testing are a little bit different and that's where we're going to spend the next several lessons of this course looking at. What I'd like to do now is go over a couple of real examples. One of them involves a very large random tester testing quite complicated pieces of software. The other one is almost trivial, it's a tiny almost one line random tester testing a small function that I wrote. So let's take a look at this.
I would like to show you a little bit about how a tool that my research group created called Csmith works. Csmith is just a random test case generator. We use it to try break C compilers. Let's just take a little bit of a look and see how it works.
What I'm going to do here is shell into a machine belonging to my group. And we're going to look at an in-progress Csmith run that's been going for a couple of days. So let's see the results. Let's see what happened. Okay. So here's what happened over the last couple of days with this testing run. So what I'm doing is making a program using Csmith and then using the latest version of GCC and the latest version of clang that is LLVM C front-end to compile each test case with a variety of optimization levels. And I'm not using any kind of weird optimization levels. I'm basically just compiling it at -00, -01, -02. And it doesn't matter if you're not a C compiler user. These are basically the sort of vanilla compiler optimization levels that any compiler user would probably be using so what the bottom of this output shows us is that 150 thousand tests have ran since I started this testing run and during this testing run GCC has failed a half dozen times or so, clang has failed a few times, and also we see a few Csmith failures.
It could be that the Csmith failures are actual bugs but historically speaking most of the time these are timeouts and so generally all of these tools are run under timeouts when we use a random testing loop because random tests tend to be really good at finding performance pathologies where the tool runs for a really long time. And that wastes time so what we do is we just kill any tool after it has ran for several minutes including our random test case generator itself so let's see what we can learn here. So what I'm going to do is look and see if we found any interesting LLVM failures. And so when LLVM crashes it always tells us how it crashed using an assertion violation.
And so as I've discussed a couple of lessons ago, many real programs contain a lot of assertions and these compilers certainly have that property. Okay. So what we can see is that LLVM has crashed it looks like 4 times here. One time with a message about wrong topological sorting. And 3 times with a message about unknown live-in to the entry block. And it turns out that I happen to know that this wrong topological sorting bug is not new. This is the one that I reported a couple of weeks ago. On the other hand, this unknown live-in to the entry block bug is probably new. So one thing we could do is check in LLVM's Bugzilla if this actually is new, but what I'm going to do now is just assume that it is and let's take a little bit of a look at this bug. So here we are in the actual log of Csmith's output. And what it's telling us is what command line arguments it used when it invoked Csmith to generate the program that killed clang. And so what you can see is it contains a bunch of options but also it contains the random number seed that Csmith uses so we're going to need to cut and paste this, and we can pass this to a different script, which is going to generate this program again. And let's edit it. Let's look at it.
What does the random program looked like that killed clang? Here we have a bunch of preamble stuff at the top and let's get down to the random code. Okay. Here's the random code. That's all. And so what you can see is just basically it's just a bunch of random junk. And so what are we doing here? We're making function calls. We have a little integer constant. We have some relational operators and so this is the largest program. Let's see how big it is. This program is—oh it's not that big. It's 1600 lines or 37 kilobytes of code.
But right now we don't know what about that program caused clang to crash. All we know is that it crashed it so let's just look and make sure that that works. So here you can see that clang did indeed die in that input. We can see that—here we see the assertion failure, unknown live-in to the entry block. And then the rest of this is just a bunch of sort of other information that clang gives when it crashes.
For example, the stack trace of functions or actually the functions that we're executing when the compiler crashed. It's giving us a detailed version of the arguments and some other stuff. This is all the stuff that we would include in the bug report when I report this bug to the LLVM developers, which I will do tonight after I finish recording lectures.
So let's see now if we found anything about GCC that's interesting. And so here we saw that LLVM includes this assertion string when it crashes. And GCC includes a different message all the time. It's an internal compiler error. Okay. And so what we can see here is that several times GCC responded with a message saying that it was killed and these do not represent compiler bugs that we're interested in. These are just performance pathologies where the compiler ran for quite a while on some random input. And those tend to be hard to report as bugs so we're not going to worry about that. We can see that there was one internal compiler error where verify_ssa failed.
I believe that's a known bug so we're not going to worry about it. There's a segmentation fault where GCC crashed, and most often that's due to dereferencing a null pointer. We're not going to worry about that either. There's an error in the get loop body that happens to be a known bug. However, that bug has been reported, and then we have an error here which looks to me to be new. We have internal compiler error in get initial def for induction at tree vect and so tree vect means that error was in the loop vectorizer which is a module of GCC that has been getting quite a bit of development and quite a bit of attention in recent years. This is the place where we'd sort of expect to find bugs. And so I'm not going to go generate the program that caused this crash since we already saw that example for clang but basically what we've seen here how a random testing loop works.
And so if we actually look at the activity on this machine, we can see that lots of programs are running here involving evaluate program as one of the driver scripts of the random tester. Csmith of course is a random test case generator. CC1 and clang are compiler processes. These compilers are sitting here using all 8 cores on a rather fast modern machine in order to break compilers.
Just go back to the diagram--what we just saw was exactly an example of this general outline that I showed you, so Csmith was the random test case generator. It was given seeds by a driver script--the driver script ran the tools in a loop and the oracle in this case was just simply looking for compiler crashes, so it wasn't even running the compiler output. Of course, when the compiler runs, it generates code. We weren't even running that code or even looking at it--all we were doing is waiting for the compiler to tell us that it failed, and the reason the compiler could tell us that it failed is because the compiler developers have conveniently included lots of assertion violations.
If those compiler developers hadn't included all of those assertions, then these compilers wouldn't have failed in such an obvious way, but they still had encountered bugs and they probably would've failed in different ways later and it might have failed in the worse possible way which is where the compiler generates code and that code actually executes improperly at runtime. And that's bad because we, as compiler consumers, don't want the compiler giving us wrong code.
So while those assertions can be annoying as compiler uses because the compiler crashes, they're actually very good and we want them to be there. So the driver scripts are then checking the output, taking the test cases or in this case the seed in the log file and then go back and do it again.
And so if you remember, in about 2 days of testing time, this loop would've executed about 150,000 times on a fast 8 core machine, and of course, when we're testing much simpler systems and compilers we might have executed a much larger number of tests than that and if we're executing the very slowest software under test, then we might've only executed a much smaller number of iterations
The example we just saw using Csmith to test C compilers represented several years of work by my group--it's really sort of a large effort for us. What I want to show now is a different example of random testing. We're going to test a tiny UNIX utility function using a very tiny random tester. And so if you remember a couple of lessons ago we talked about the UNIX read system call, so let's look at how the read system call works.
Before I get going into this example, I should apologize a bit for dropping in to C here. I tried to work this example into Python and it really just didn't come out right-- it just didn't feel right and it sort of is inherently a C example. So it's not going to matter if you don't know C or if you don't like C--I'll go through the code. Really it will be rather simple. The UNIX read system call as called from the C program takes three arguments. It takes fd which is a file descriptor, it takes buf which is a pointer to a block of memory into which data is going to be read, and it takes a number of bytes to read from the file represented by the file descriptor into the memory block pointed to by buf.
And now usually what the read system call does is it reads the number of bytes that you expected into the memory block that you provided and all is good. The return value of read is going to be the number of bytes read, so that's what usually happens, but there are a couple of other things that can happen. A second possibility is the read system call can return 0, indicating that you've reached the end of the file. Another thing that can happen is it can return -1 if it failed. But the fourth possibility and this is the one that is sort of pernicious and its one that a lot of programmers get wrong is that read can return a number of bytes less than the number you asked for and this isn't a failure--this doesn't represent any sort of out of memory condition or end of file or anything like that--it just means we need to try again.
And so the little UNIX utility function that I'm going to test here is a different version of read. It's a version of read called read_all that acts just like read except that it's not going to have this property of returning partial reads ever. So what read always is going to do is in the case of a partial read is going to just issue another read call that picks up where the first call left off and it's going to repeatedly do that until the read is complete or until some sort of an error or end of file condition occurs. So if anything bad happens, it will just return a -1 value which is the sort of standard UNIX error value. If the read calls are succeeding but returning partial results, we're just going to keep issuing read calls until we're finished, and so let's look at the code that does this.
Here's my read_all function--this is the code I just took a few minutes to write, and what you can see is that I've included a couple of assertions.
The first thing that I've asserted that the file descriptor passed in is greater than or equal to 0. These file descriptors are integers usually small ones and they have to be non-negative.
The second thing we're going to assert is that buf, that is to say a pointer to the block of memory into which we're going to read data, can't be the null pointer and just asserting buf like this is implicitly a test that the pointer is not null in C.
And the third thing we're going to assert is that the number of bytes to read is also non-negative. So, given that setup and we're just going a little sanity checking here, we can get to the main logic. So the first thing we're going to do is initialize a left variable which is going to be the number of bytes left to read to initially be the total number of bytes to read.
Now we're going to enter our while loop--the while loop is going to operate until either the read system call returns something less than 1 that is to say it returns 0 indicating an end of file condition or -1 indicating an error. And if either of those things happens, we're just going to return that value to our caller. If read works, that is to say it read some positive number of bytes, we're going to increment the buffer pointer by that number of bytes, reduce the amount of bytes left by that number of bytes, assert the number of bytes left hasn't gone negative due to some sort of a logic error, and now finally if there are no bytes left, we're going to return the number of bytes read which has to be just the original number of bytes that the user asked to read.
So this is pretty simple code and I would expect or at least I would hope to be able to get it right but I know from vast prior experience that I never get this stuff right the first time, so what I want to do is be able to test it.
So what I've done is written a little test harness, and so what the test harness does is opens a program and so here we're just using a splay tree from Python that happens to be in this directory and allocating a buffer. We don't need to worry about the details here.
Then what we're going to do is 100 different times we're going to issue the read_all command to read the contents of the file into the buffer and then we're going to assert that read_all always read the full number of bytes we're trying to read. And then we're also going to assert using a memory comparison function that the contents of that file are the contents that we expected. Its definitive variables initializes some code I didn't really talk about to contain the data from the file. And so what you're going to say if we do this 100 times, then our new read all call passes. And so let's look at what happens.
Okay, so 100 test cases passed in that tiny amount of time it took us to run. And so one thing we could conclude is that since 100 tests passed over read_all that the logic is solid but it turns out that this conclusion isn't really warranted. Let's see if we can get the program to tell us why. So what I'm going to do here is print out a value indicating the number of bytes that was actually read by the raw read system call and then we'll run our testing loop. And what happens is every single time we called read, it read the full size of the file.
So our splay.py here--we can see that's 3121 bytes and the actual Unix read system call always read that many bytes every time we call it. So it ends being the case even though it's legal for read to return less bytes than we asked it to read is not actually doing this, so what we're going to need to do in order to test our logic in any meaningful way is make it do that, and so the question is how to do that. Well, one way to do that would be to go hack Mac OS so that sometimes doesn't read all the file bytes that we asked. That is not very fun, so we're going to do something a little easier.
We're going to insert faults ourselves by calling a read fault injection file. By calling a read function with fault injection which is the function I already wrote here. So what that read with fault injection does is it has exactly the same interface as read just a different filename, but what it's going to do is instead of reading the nbytes that we asked for it's going to set the number of bytes to read to be a random number between 1 and the number of bytes inclusive. And the +1 there means that instead of going from 0 to the number of bytes minus 1, we're going to test the range from 1 to the number of bytes, and that's what we want to do. We don't want to return 0 because that indicates an end of file condition. So now we're ready to run some tests.
We can see that our fault injection version of read is reading less than the number of bytes we ask for. You can see that the first time it's called, it's going to pick a number between 1 and 3121, so it's going to pick something in the middle usually, but then successive read calls have a narrower and narrower range of random numbers they can return, so that's what explains this progression of numbers towards smaller numbers. There should be 100 of these sequences, since we let the test run 100 times. So, the question we can ask now is, is our confidence in the software higher? Well, probably there are other tests we should do. One thing we could do is use a read fault injection function that, instead of being random, reads one byte every time. That might end up being a reasonable stress test. Another thing we might do is simulate random end-of-file conditions and random errors that read is allowed to return. So, during our testing, the read system call never returned an error value, it always read the file successfully. But if we want to make it do that, we have to modify our program a little bit. I'm not going to do that right now. In the mean time, I think we've established that the logic here is pretty sound. It can deal with a wide variety of conditions. One thing we might want to do just before leaving is run it a million times instead of 100 times.
OK, so we run a million iterations and let's see how long this takes. Hopefully not too long. Ok so it took like 15 or 20 seconds. So the question is now, do we have reasonable confidence that our logic is solid? And except for the error handling condition, I would say that this is pretty solid code now.
We go back to our overall master diagram. What we have is a driver script which in this case was just the c program. We have a random test case generator which is basically two lines of code--one of them computed a random number of bytes to read and the other one, actually called the read function. The software under test was the other function--the read_all function that we wrote.
The oracle was implemented by memory comparison function--that is to say did the read_all command actually read the right bytes from a file--and so as we saw, the oracle never detected an error since we got the function right and everything worked out. This case of implementing a read_all call which is a stand in for read that behaves better. It has been on my mind lately because I assigned this as one problem on the final exam for my operating systems class that my students finished a couple of weeks ago. And it turned out that many of the students didn't implement the right code.
A lot of times they forgot in the read_all command, for example, to increment buffer so that subsequent calls to read wouldn't overwrite the original call to read. I saw so many of those that I decided I write the function myself and see if it was hard, and it was a little bit tricky but it worked out and that's why I thought it might make a good example for you all.
Hopefully what you learned from the previous example is that building a random test case generator doesn't have to be very difficult. But realistically, it's usually a little bit more involved than the one we just saw. And the key problem is generating inputs that are valid. Or another way to say that is generating inputs that are part of the input domain for the software under test. What I wanted to do here is look at the entire space of random inputs.
And what I mean by entire space of random inputs is we haven't ruled anything out. So what we get is random 1s and 0s. And so remember that we're just constructing random inputs, we're going to be feeding those into some software under test and we're looking for outputs in the range of the software under test. And let's say for the sake of argument that we're testing a web browser. And so the question we have to ask ourselves is how much of the space of totally random 1s and 0s constitutes a valid input for a web browser and when I say the word testing, the rendering engine part of a web browser so we're testing what happens when data comes over the web to the web browser in response to an HTTP request and now the web browser wants to render that. So we have total random 1s and 0s.
As you might expect almost all arbitrary combinations of 1s and 0s fail to create valid web pages so there's going to be some small subset of the set of random inputs that constitutes valid inputs to the web browser. If we take one of these other inputs and hand it to the web browser it's going to mapped to a part of the output space that corresponds to a malformed HTML.
One thing we might ask is, isn't it a good thing to test web browsers with completely invalid inputs to see if for example instead of returning a page error it's easily possible that some subset of this space corresponds to browser crashes? The answer is yes. We definitely do want to test our web browser with completely random inputs. But on the other hand, the amount of code in the web browser that serves to classify codes as renderable versus not renderable there's some very tiny fraction of the total amount of code in the web browser.
Let's take a little bit different view of the same problem. So, what I want to do is draw a graph here, showing the level of code coverage that our random test cases are inducing on the system under test. So again, we're testing a web browser, and on the x axis, this is going to be a little bit fuzzy unfortunately.
But what I want to show is how far into the web browser we execute. Which is just checking, for example, if the incoming data is even valid HTTP. Once we get valid HTTP, the browser is going to scan and make sure it got valid HTML. It's going to be doing lexical analysis and checking of HTML. If the input fails to be rejected by this kind of code, it's going to go on the rendering engine and finally, it might have some sort of more advanced processing which is dealing with things like forms, cookies, scripting and such. Okay so, we have this graph.
So, now let's see what happens when we fuzz the web browser using totally random bits. Well, what's kind of most likely to happen is most of those bits that come in, are not even going to be valid HTTP responses. So, we're going to get coverage that rapidly drops off and what's left is almost always going to fail somewhere else. What we're going to see is, we're going to spend the bulk of our testing effort rejecting random sequences of bits very early on and very little of our testing effort, testing code here. Again, as I said, if that's what we're trying to do, if we really want to be stressing the early parts of the web browser code, then that's great. And random testing is perfectly good at that, but on the other hand, if what we are interested in is for the broad coverage of the software under test, then we're going to fail. The red color indicates random bits.
The next thing we can do is we could rig our random input generator so that it respects the constraints of the HTTP protocol. Furthermore, we can adapt it, so the text that it generates contains valid lexical elements of HTML. That is to say it's composed of things like braces with tags in them, other kinds of text or other directives, but this isn't too hard to do. So, if we do something like that, what I'm going to call that is, using green to represent protocol correct code. I'm using sort of fuzzy terms of my own devising here, I'm not trying to use any kind of standard terms.
So now what's going to happen is, hopefully we'll get pretty good coverage of the protocol code still. We should get quite good coverage of lexical HTML processing and we're going to fall off the cliff again. Because as soon as we get to the renderer, it's going to become apparent that we didn't try hard enough to generate valid HTML and we're not going to get something to render very often. So now, what we've accomplished here while we pushed the coverage that we're getting on the software under test farther into the system, it is farther into the HTML processing chain, but still haven't pushed it very far. So, the next thing that we can do is use some sort of a grammar or some sort of a structural definition of HTML to generate random but valid HTML. The next thing is valid HTML. And so what's going to happen is, coverage of the protocol code and the lexer may decrease, while on the other hand, we're going to be able to push into the HTML processing code quite deeply before falling off a cliff.
So what've we done? We've traded off coverage in the early parts of the web browser which may well be so simple that we don't care about much about them for coverage farther in. And so finally, what we could do is, generate random code that includes elements of scripting, forms, whatever else that we're interested in testing and we can run that through and now we can start randomly testing our browser with this. What's going to happen now is, our coverage might decrease even a little bit more in the early parts because we're spending more time doing other things, but we're probably not going to fall off a cliff at all. And so, you can see that in most cases when we do random testing, what we're looking for is something like this kind of flat line, and what this flat line indicates is that we're covering all parts of the software under test roughly equally.
What we're going to see is as we look through more random testing examples, is getting sort of a coverage curve like this often requires quite a lot of work, quite a lot of sensitivity to the structure of the input domain, but on the other hand, we get paid back for that work with random tests that can exercise the entire software under test and that's going to be a valuable thing in many cases.
Okay. Now that we've introduced this input validity problem, let's work toward a situation, which is going to lead towards a programming problem for you where you have to solve this kind of problem.
The specific problem that we're going to look at is generating valid credit card numbers. And so you might ask the question, "Why would we care about generating valid credit card numbers?" and the assumption here is that you are working on some software under test whose input domain is valid credit card numbers. And so your software under test is probably doing something like processing payments and its output is going to be something like completed transactions.
So now how are you going to test this? Well, one thing you're going to do is collect some credit cards that you have and that your friends have. And so you'll have a variety of Mastercard and American Express and other kinds of credit cards. You're going to run it through your payment processing system and you're going to see if it works. After you're finished doing that, the question you have to ask is "Have you shaken all the bugs out of your code?" and the answer is probably not. So what we're going to want to do is randomly generate valid credit card numbers and use that to test the payment processing system. To get any further here, what we're going to have to do is take a look at the structure of a credit card number. So here's how those work.
So there are some number of digits at the start of a credit card number, so let's say 6, that constitute the issuer identifier. What this means is that for example all of the American Express cards issued by a certain company are going to share an issuer identifier. The next set of digits, let's say 8 of them, is going to correspond to the account number for this credit card and then finally at the end is a check digit. And this is computed algorithmically from the preceding numbers, and the function of the check digit is to serve as a validity check for a credit card number.
So if I'm entering my credit card into a website or if I'm swiping my credit card and the magnetic strip has gotten corrupted it's fairly likely that if any of these digits are wrong the check digit is going to catch it and the credit card number can be rejected rapidly and algorithmically other than actually having to submit this thing to somebody who can process the payments and forcing them to reject it when the account number or the issuer identifier fails to look up correctly.
The check digit is computed using something called Luhn's algorithm. So this fellow Luhn was a guy who worked for IBM during the middle part of the 20th century. And in 1960, he patented Luhn's algorithm which was simply a method for calculating this check digit and it has been used ever since in applications like this. Let's look at how to use Luhn's algorithm to calculate a check digit.
So what we're going to have is a sequence of numbers, and this will work for any number. And now there are two cases. If there's an odd number of digits, we're going to do one thing. And if there's an even number, there'll be a slight variation. If there's an odd number of digits, what we're going to do is go through the digits and take every even-numbered digit and multiply it times 2.
So the odd digits we're going to leave alone and the even ones we're going to multiply times 2. And the next thing we're going to have to do is see if any of the resulting numbers are greater than 9. If so, we're going to have to subtract 9 and some number is going to result.
Now what we do is we sum up all of these digits that we've derived — that is to say the odd number of digits we just summed up, the even number of digits we summed up, the results of these calculations, and then we compute the value Mod 10--that is to say we just took the 1s digit of the sum and if that value was equal to 0, you have a valid credit number. If the value does not come out to be 0, then it's invalid. Okay. So that was the case for the odd number of digits.
Now, let's briefly look at the case for an even number of digits and that's almost the same. So if we have an even number of digits, we go ahead and take the odd number of digits, double them. If the doubled number ends up being 10 or higher we subtract 9, then after that the process is the same. We add them up, take the 1s digit or equivalently take the sum Mod 10 and see if the result is 0. So that is Luhn's algorithm, and the reason that we need to go over that in a bit of painstaking detail is because if you want to generate valid credit car numbers, then you're going to need to implement Luhn's algorithm in order to do that, and so there's one little detail that I left off.
And so what I just told you is way to check whether the number is valid and what you're going to need to do is create a valid credit card number. So you're going to be given the issue identifier, you're going to make up the account number randomly, and then what you're going to do is put a 0 at the end of the credit card number, use Luhn's algorithm to compute the check sum for it and the result of that is unlikely to be 0. If the check sum comes out to be 0, then you've already generated a valid credit number. If it hasn't, then you're going to have to take 10 to track the check sum from 10 and put whatever number you arrived at in the place of the check sum digit where you had 0 and now you have a valid credit card number, and I suggest that as you're implementing this you check it--so you compute the Luhn number. If it's not 0, you subtract it from 10, put that in the check sum position, and now you can go ahead and compute the check sum of the entire credit card number--it need to come out 0
Here's a look at what you're going to do in a little bit more detail. Remember that what you're writing is a random tester for the credit card transaction processing system.
The core of this random tester is going to be a function called generate, which is what you're going to implement, so generate is going to take two parameters. The first one is the prefix, which is going to correspond to the issuer identifier. And this is a sequence of digits which is given to you, which has to appear at the start of the credit card number that you generate.
The second parameter is the length. That is to say, an integer parameter that determines the total length in digits of the credit card number that you're going to generate. So your job then is to construct your credit card number, which has the given issuer identifier or prefix, has the required total length, has completely random integers in the middle here, and has a valid checksum digit as computed using Luhn's algorithm plus this little procedure for turning the Luhn checksum into a digit, which makes an overall credit card number's checksum come out to be 0 and therefore be valid. So let's just look at the code quick.
So what you're writing is a function called generate. It takes the prefix and the length and so you're code is going to go here. And so we might call it like this. We might say the prefix is 372542. This happens to be the prefix for a credit card that I have in my pocket and a 15-digit credit card number and now you're going to create a valid credit card number with the required properties. To do this, you're probably going to want to implement a function called luhn_checksum which computes the checksum of a number and this auxiliary function is_luhn_valid simply checks if the Luhn checksum for a given input is 0. What I'd like you to do is implement that now.
If you get totally stuck and you like kind of a hint, there's something you can do which is look up Luhn's algorithm on Wikipedia. The reason that's kind of cheating is because on this Wikipedia page for Luhn's algorithm they actually will give you a Python code that works. But don't go there unless you actually want to see the real answer. Okay. Take a few minutes now, write the code, and what our API will do is use it to generate a number of credit cards and check if they're valid. If they're not valid, it will give you some indication on that the problem is. And if they are valid, then it will tell you at the end and you can move on to the solution.
Before I let you go hack on this, I just want to give you a quick reminder about something that should save you some trouble so in the discussion I did here about Luhn's algorithm I assumed that the digits were numbered starting out 1. For example, #1 here is an odd-numbered digit. This is digit #2 which is an even-numbered digit, #3 is odd, etc. Now on the other hand, Python strings are indexed starting at 0. That's something you're going to have to watch out for. Or else you're going to go a little off by 1 error in your code that makes it a pain to debug.
def luhn_digit(n): n = 2*n if (n>9): return n-9 else: return n def luhn_checksum(n): l = len(n) sum = 0 if l%2 == 0: for i in range(l): if (i+1)%2 == 0: sum += int(n[i]) else: sum += luhn_digit(int(n[i])) else: for i in range(l): if (i+1)%2 == 0: sum +=luhn_digit(int(n[i])) else: sum += int(n[i]) return sum%10 def generate(pref,l): nrad = l - len(pref) - 1 assert nrad > 0 n = pref for i in range(nrand): n += str(random.randrange(10)) n += "0" check = luhn_checksum(n) if check != 0: check = 10 - check n = n[:-1] + str(check) return n def is_luhn_valid(n): return luhn_checksum(str(n)) == 0
We've still got the code the I wrote for the Luhn checksum. And this is kind of a redundant bad code but I wrote it quickly and I think it works. So we have the Luhn checksum function. And the first thing it does is looks at the length of it's argument. That is to say, it looks at the length of the string over which it's going to compute a checksum. It's also going to initialize the checksum variable to be 0. Now, if we're computing a checksum over an even number of digits, the remainder mod 2 is going to come out to 0. So in this case, we're going to loop over the digits and the checksum. That is the range(l) goes from 0 to length -1.
And here what I'm going to do is add 1 to that number so that we can get a 1-base numbering system like I showed you in the explanation of Luhns algorithm. So here now, if we're at digit 0 in Python we're at digit number 1 in the 1-base human number system and so if that comes out to be 0 when moded with 2 now with an even number digit and so in that case we're going to add the value of that digit into the sum; otherwise, we're going to call this Luhn digit function on the same value and the Luhn digit function is just sort of the obvious thing. It multiplies the number by 2. If it's greater than 9, it returns 9 less than the number. Otherwise, it just returns the number.
Now, the other case is that we're checksumming an odd number of digits. In that case, we have exactly the same logic with the even and odd cases flipped around. So here we're taking the Python zero-based digit adding 1 to it to get a human 1-based numbering system mod-ing with 2. If it comes out to be even, we're doing the Luhn digit computation. If it's odd, then we're just adding in the value of the number. And finally, we get a checksum that we've accumulated here. And we take that mod 10 and that gives us our actual checksum value that we're going to use. So now, here's the algorithm for generating a valid random credit card number.
So the inputs are a prefix and a total length in digits. So the number of random digits we're going to generate is the total length minus the length of the prefix minus 1 for the checksum digit. Now we're just going to assert that that's greater than 0 just to be careful. And now we're going to set N as our credit card number that's in progress.
So we're going to initialize that to be just the prefix. Now for the number of random digits we have, we're just going to make up a random digit and append it to the string. Finally, we're going to add a zero checksum to it and compute the Luhn checksum.
If that Luhn checksum comes out to be zero, we're done. If that Luhn checksum comes out to be non-zero, we have to do the little inversion that I talked about where we take the checksum where we subtract the checksum from 10 and now what this little bit of logic does is it strips off the last character. That's the zero we added from the credit card number and adds on the checksum number. So that's how easy it is to generate a valid credit card number. And now there is one valid function just to check if the Luhn checksum is 0.
Now here we have below a code from Wikipedia that does the same thing. And I'm not going to go through the logic here. But what you can see is this is quite a bit more idiomatic Python. It's actually quite a bit nicer than the code that I wrote. So if you like that better then use this as a model instead of the code that I wrote.
The code that I wrote is pretty kind of dumb and obvious. We have equivalently a Luhn valid check sum using the Wikipedia algorithm, which just does the obvious thing. And then what I have here is a random tester, which generates a random credit card number with a certain prefix and 15 digits and then ensures is that it's valid. The validity checking function for credit card numbers simply makes sure that it has the right length, that it has the right prefix, and that the checksum comes out to be 0. That is to say, the is-Luhn-valid function returns true. So that's all there is to it. But what I want to do finally is take a look at one other issue.
I'm going to comment out my code here and comment in some different code. What we're doing here is generating completely random 15-digit numbers. What I'm doing is generating a random integer between 0 and the smallest 16-digit number. The largest number that could be generated here is the largest 15-digit number. And then we're going to zero-fill, convert that to a string, and do a zero-fill operation, which adds leading zeros. Now we have a completely random number that is 15 digits long. And if that checks out to be a valid credit card number, we're going to increment our validity checker and then finally after doing this 100 thousand times we're going to print the number of valid credit card numbers that we got.
So let's run this and see what happens. Okay. We got no valid credit card numbers out of 100 thousand. So the problem is the prefix was too long. With a 6-digit prefix, the chance is one in a million that we'll generate just this prefix and then it goes down to one in 10 million that we'll meet the prefix and the checksum requirement. So if we start off with a much smaller prefix like just 37 and this is basically anything in the American Express system I think, now let's see what happens. We're going to generate 100 thousand credit card numbers and 104 of them came out to be valid. So even with just a two-digit prefix, it's pretty unlikely that we generate valid credit card numbers.
And so what that means is if we're generating lots of invalid credit card numbers of course we're stressing only a very small bit of a transaction processing logic that checks for valid credit card numbers and we're not stressing the rest of it. So what I hoped I accomplished here is first of all motivated the fact that this generation of valid data is a real one and second of all, to give you a little bit of a feel for what code looks like that we usually have to write to generate valid inputs.
And so, if we go back to our web browser example, you can see that we will be doing a similar exercise but it'd just be quite a bit more sophisticated to generate for example a valid HTML or a valid HTML with scripts and other things. That it would take to actually do meaningful fuzzing of a web browser as shown by the blue line here.
And so now to do this, instead of spending half an hour or however long you spent on the API quiz now maybe you're going to be spending many weeks. But on the other hand, what we're going to get out of this if we do it right is a lot of high-value bugs including security bugs in our web browser and it's strongly possible, but of course not guaranteed, that the value we get out of those bugs that we find in a web browser would've been worth it.
So in the two examples that we just looked at, there was a vast space of inputs and some small part of that constituted the actual input domain for the software under test. And what happened was when we generated invalid inputs, which was extremely easy to do, they were executing a boring part of the software under test and getting them mapped to input rejection errors. And those rejections of invalid input were basically forcing us to generate another value and submit it and see what happens.
So what's going to happen is the random testing loop is going to spend a lot of time spinning. That is to say, stressing on only a small part of the software under test. But there's actually something different that can happen that's much worse than that.
So the other thing that can happen is the input isn't rejected. And the reason that an invalid input might not be rejected is that the software under test failed to contain sufficient validity checking logic in order to distinguish valid inputs from invalid inputs. And so what we're getting now is something potentially much worse. We get strange misbehavior, crashes, or something else very bad. And the problem is that when we get misbehavior, crashes, and other bad behavior from the software under test, these behaviors instead of being rejected by the software under test and letting our random testing loop proceed, these behaviors look like bugs.
So normally if we crash the software under test, this is something that we have to track down manually and now we're taking a developer's time. And once we start taking human's time, what we're essentially doing is mounting a denial of service attack on whoever is processing bugs that come out of the random testing loop. After looking at about 10 bugs that are caused by invalid inputs to the software under test the person who is looking at that input is going to get tired of it and isn't going to look at them any more and now we have a problem. Now we have a random testing loop producing no actionable output. And so our random testing effort has basically failed.
And so let's talk a little bit about why software under test would lack a validity checker. Well, as we saw I believe in Lesson 1, internal interfaces, that is to say interfaces within a product and interfaces that don't span trust boundaries often lack full validity checking. And they do that for performance reasons, for code maintainability reasons, and basically because we have to trust somebody sometimes. Otherwise, we can't get any software written. Another reason is, there exists software for which it is impossible to construct a validity checker. And so earlier in this lesson, I was talking about the example of using Csmith.
That is to say, my random test generator that generates C programs for testing compilers and it turns out that C compilers don't contain a good validity checker for C. And the reasons for that are kind of involved but the quick answer is that if you have a C program it's undecidable whether that program dereferences a null pointer, accesses an out-of-bounds array element, or does something else that violates the rules of C.
So compilers do do quite a lot of validity checking-- that is to say, they look for syntactically invalid C. What they fail to do is check for the dynamic validity properties that are required for certain kinds of miscompilation bugs. So in summary, the input validity problem bites us as a performance problem almost every time we write a random tester. It bites us in a much more serious way when we have some software under test that lacks a reliable validity checker.
So now that you've seen how easy it is to build a random tester, I hope that you build many of them in the future and most of the rest of this course is going to be spent going into more details about how we build effective random testers. But before we do that, what I'd like to do is address an issue that eventually confronts almost everybody who works on random testing, which is that random testing gets no respect. And what I mean by that is people often think that it's a really stupid way to test software.
For example, one of the classic references on software testing is a book called The Art of Software Testing. In this book, the author talks about random testing and here's what he has to say. "Probably the poorest methodology of all is random input testing... What we look for is a set of thought processes that allow one to select a set of test data more intelligently." So basically what Myers is saying is random input testing is not a way that allows one to select a set of test data intelligently.
And so my opinion, and this is after writing at least a dozen random testers, what he should've said is something like what we look for in testing is a set of thought processes that allow one to randomly generate a set of test data more intelligently. And so what we can't do is get around this fact that testing requires thought about the software under test. But my experience and the experience of a lot of other people is that if we think hard about a random test case generator that's as good or better in many cases than thinking hard about generating the test cases themselves.
So another classic piece of work is a book chapter called Random Testing by Richard Hamlet. And in this chapter he says, "Most criticism of random testing is really objection to misapplication of the method using inappropriate input distribution." And so what Hamlet is really talking about here is the input validity problem that we've just been discussing and what he means is that if you ignore the input validity problem and you just test randomly using completely random garbage, you're going to get a bad impression of the method because it's not going to work very well.
You're going to experience a phenomenon that we talked about a little bit ago where all of the test cases get rejected by very early parts of the software under test. And so I would actually say what Hamlet said a little bit differently. Most criticism in random testing is really objection to random testing done badly.
So let's take a look at one more quote. So what the author of Testing for Zero Bugs has to say is, "The introduction of random testing practically eliminated user bug reports on released back ends. To our amazement, RIG (and this is their test case generator) was able to find over half of all bugs reported by customers on the code generator in just one night of run time." And so this clearly was written by somebody who did a really good job creating a random tester. And also probably applied in the domain where random testing happened to work really well. The rest of this Testing for Zero Bugs is worth reading as well, and we will include a link to that in the supplemental material for the course.
Well, I'd like to talk a little bit about how random testing differs from fuzzing and the short answer is they're the same thing. The long answer is going to require a bit of explanation. So let's go back in time to 1990 when a professor called Bart Miller and his students published a paper called An Empirical Study of the Reliability of Unix Utilities. And so what they did as part of the fuzzing effort was provide a completely random data to a bunch of Unix command-line utilities. These were things like editors, terminal programs, text processing utilities, and other similar Unix tools that you can basically think of as being tools predating the era of graphical user interfaces on Unix systems.
And what they found is using this incredibly simple technique, that is doing random testing without worrying at all about the input validity problem they were able to crash a quarter to a third of these utilities on basically any version of Unix that they tested. And so what you have here is a pretty strong result. They were able to crash lots of applications with minimal effort.
What that means is that the quality of the input validation done by these sorts of programs at the time was really rather bad. A few years later in 1995, the same group repeated the effort and wrote another paper about this. This time they not only tested the same kind of utilities that they tested 5 years earlier but they extended the work to testing network applications and GUI applications and basically they got very similar results.
Now, in another 5 years in 2000, the same people did another study and this time they fuzzed Windows applications. And what they found was basically more of the same. They can crash most of the applications that they tested. And then finally in 2006, the most recent installment of a fuzzing study by this group was published. This time they attacked Mac OS X. And this time they found something a little bit different. The command-line utilities on Max OS X would hardly crash. They found a much lower rate of crashes than they had found earlier. But on the other hand, of the 30 GUI apps that they tested, 22 could be crashed.
It's worth mentioning that as this group evolved their fuzzing work, they kept having to write new tools. For example, to fuzz the Windows applications they had to generate Windows events to GUI applications and they had to do something similar for Mac OS and previously for X Windows applications. So they had to keep evolving their tools but the input generation methodology that they used, that is to say basically generating random garbage and not really worrying about the input validity problem remained the same across all of these studies.
So now what I've covered so far was this particular random testing effort by this one research group. But something interesting happened I believe sometime around 2000 or a little after is the term fuzzing took on another use.
Something interesting happened and I believe it was not too long after about 2000 is that the term fuzzing took on another connotation and in this connotation, fuzzing basically means penetration testing, and what I mean by that is fuzzing is used for a very specific purpose which is finding security vulnerabilities in applications. And usually the connotation is--is that the fuzzing is being done on applications that the person doing the fuzzing didn't write. So for example what we might do is find some sort of a machine on the Internet that offers a service and we'll do random testing over the Internet of that service with the goal of finding bugs in the service that are going to let us mount an attack such as a denial of service or some sort of an exploitable vulnerability that will let us mount an intrusion on that Internet service.
The sort of an interesting thing about this newer connotation of fuzzing is that it's really gotten rather big. So for an example, plenty of special purpose tools are available both commercial and open source for performing fuzzing over the network. The best of these contains a lot of canned knowledge about protocols that allow a lot of relatively unskilled users to find bugs in services that are available over the network.
This course is mostly not going to be about this particular kind of fuzzing. Rather, the emphasis of the course is more on random testing in the original sense of fuzzing which is to say trying to find flaws in software rather than this very specific meaning of fuzzing for penetration testing purposes.
And now one of the challenge problems for this lesson is going to involve writing something with more like this kind of fuzzer, but basically mostly we're concerned with random testing sort of in the general software robustness. Mainly we're concerned random testing as it would be applied to our own software or to software developed by other people in our own organization.
Alternate history is a genre of fiction where we explore what would have happened to the world had some historical event turned out differently. For example, we might have an alternate history now exploring what would have happened if the Allies hadn't won World War II.
What we're going to do is look at a few alternate history examples where we're going to try to explore the question, "What would've happened if certain groups of people in the past had used random testing in addition to whatever other kinds of testing they were doing?" And we're going to sort of specifically focus on the question, "Would random testing have been likely to alter the outcome of some famous software bugs or would it have not made much difference? Would it have not been a very useful technique?"
And so the first case study we're going to look at is the FDIV bug in Intel Pentium chips. And so in 1994, it came to the world's attention that the FDIV instruction supported by Intel's Pentium processor was flawed and so FDIV is an instruction that you would use to perform floating point division. If we're writing an assembly language for a Pentium, we would issue the FDIV instruction with two operands and what it would do is compute the quotient of a divided by b.
And so what happened is there's a bug in FDIV. And what had happened was the implementation of FDIV, that is to say the hardware implementation of this machine instruction was intended to be very fast. The Pentium was, at that time, a very high performance processor. And the way FDIV was fast, at least in part, was because it did part of its work using table lookups. So there was this table of values stored in the Pentium's hardware.
What had happened was somewhere on the way to fabrication some of the values had not been loading correctly into the table. So part of the lookup table, I believe it was 5 entries, contained the wrong results. And for some values of A and B passed to the FDIV instruction, looking up these wrong results in the lookup table would cause it to do the wrong thing. This wasn't an extremely major error in the sense that it would return a million instead of 0 but rather it was off in sort of some number of places after the decimal point.
The reason that the flaws in FDIV went undetected for sometime was that only about 1 in 9 Billion randomly selected inputs would actually trigger the bug. And so now the question we need to ask ourselves is given this relatively low observed failure rate on random inputs, would random testing have been a good way to find the Pentium FDIV bug? And the answer is almost certainly yes. So what we're going to do now is try to figure out about how long it would've taken Intel to find this bug using random testing.
And so the first thing we're going to need to do is make an assumption about how many tests per second they can run. And so it's not as if in 1994 they have modern multigigahertz hardware available but rather these Pentiums ran at around 60 MHz. And so what we're going to say for the sake of argument is that it took 10 µs to perform an FDIV and verify its result. So in other words, at 60 MHz, I'm assuming that it would take about 600 cycles to check a particular input for FDIV. Now let's try to work out the math.
All right. Let's try to work out the math. So what we have is, by assumption, we can run 1 test every 10 microseconds. Now, there are a million microseconds in a second, 60 seconds in a minute, and 60 minutes in an hour, and 24 hours in a day.
So now we're going to do some unit cancellation. We can kill microseconds, we can kill minutes, we can kill seconds, and we can kill hours. So if we do the multiplication, we can get tests per day. And if we do that, we get 8,640,000,000 tests per day. If we multiply this testing throughput by the failure rate we're going to get 1 failure per 9 billion tests. We can cancel tests, do the division, and arrive at 0.96 expected failures per day.
So under what I think are fairly modest assumptions here, if we perform completely random testing of the input space for fdiv, we should be able to find this bug in about a day. And so now this kind of testing is going to need some sort of an oracle. So we're going to need a way to tell if our particular output from fdiv is correct or not. And the way this is going to work is IEEE floating point, which is what fdiv is implementing, is specified very tightly. That is to say, one implementation of IEEE floating point division has to return the same bit pattern as another implementation.
That's one of the nice things about that particular specification is that it's fairly tight. So we ask ourselves, what would have been the oracle for fdiv? And probably it would have been Intel's existing 487 floating point unit, which had been around for some years by the time they were developing the Pentium.
So what I think this shows, unless I've messed up sort of egregiously on the math somewhere, is that random testing would have been a perfectly good way to find the Intel Pentium fdiv flaw, presuming, of course, that we could have found a way to rig up a Pentium in concert with a 487 in such a way that they could have cooperated to do the testing.
So now let's talk about the 1988 Internet worm. There are several interesting things about this Internet worm. Probably the main one is that it was one of the first worms that actually got widespread attention. It got this attention for good reason.
If you remember 1988, the Internet was not particularly well known to the general public, and it had a relatively small number of users. And even so, this worm infected an estimated 6,000 machines. And while this is a really tiny number compared to a modern worm or a modern botnet or something like this, this was a substantial fraction of the number of machines connected to the Internet at the time.
The way this worm spread is it used computers' Internet connections to exploit known vulnerabilities in the UNIX hosts of the time. Of course, at the time, the existence of a remotely exploitable bug wasn't considered nearly as serious as it would be considered today because, of course, the 1988 worm and all of the subsequent ones hadn't happened yet.
One of these bugs was a buffer overflow exploit in the finger daemon, and this was a service that would run on UNIX machines of the time, and the finger daemon would let you query a remote machine to learn about whether a user was logged in to that machine and some other stuff. And so now let's ask the question, would random testing have changed the outcome? Well, it seems extremely likely not because these bugs were known at the time.
On the other hand, let's ask a little bit different question. Could this bug in finger daemon and lots of other bugs like it have been found by random testing? And the answer to the question is probably yes.
In fact, if we go back to the original fuzzing paper, one of the bugs that was found was caused by the same programming practice that provided one of the security holes to the Internet worm. So basically, even in its original fairly weak form where fuzzing was done with completely random data, it was finding the kind of bugs that were causing security holes. This remains true to this day, that fuzzers are used to find a lot of exploitable vulnerabilities in hosts that have Internet-facing services.
So in summary, it could have found the kind of bugs that the worm exploited and others like it had people been running fuzzers a couple of years earlier.
What we've been doing in this lesson is looking at a progression of random testers, from the simpler ones to the more complicated ones. And as we've seen, one of the major things creating difficulties for us is that structure is required in inputs.
What we're going to do now is look at the next level of structured inputs, and that level of structure is required for random testing of APIs. So if you remember from our pictures earlier in the course, we have some software under test, and it's providing an API or several APIs for other code to use. And this is the kind of APIs that we're going to focus on testing. What an API is is basically just a collection of function calls that can be invoked. And what we're going to find when we test APIs is that a single random test is not going to be something extremely simple like a credit card number or a couple of inputs to a single API call but rather, a single random test is going to constitute a string of API calls.
There are a couple of things that make this situation more complicated than what we've seen so far. The first possibility is that there are dependencies among API calls. What this means is there might be certain orderings of API calls, perhaps even combined with dependencies on the arguments to those API calls, that are either illegal or undesirable, and we're going to need to teach our random tester to avoid those.
So let's just for the moment take the example of randomly testing a file system. What file systems are are pieces of code that take the raw abstraction exported by a hard disk driver--so here's the disk itself. The disk is going to talk to a driver. The driver is a piece of code running in the operating system that understands the details of interacting with whatever kind of particular hard drive we're dealing with, and what it exports is the abstraction of a block array. A block array just means that from the point of view of the file system, what the hard disk looks like is an array of blocks where blocks are numbered 0, 1, 2 all the way up to some very large number for a modern hard drive. And so the purpose of the file system is to take this low level interface of a disk block array and build on top of it a hierarchical namespace--that is to say, a directory structure provided by file systems in UNIX or Windows-- and then those directories, of course, contain files and files are just variable sized bags of bits that we can create and delete dynamically. The file system has to manage all of that structure just to manage the free space; it has to efficiently respond to all sorts of calls that perform file system operations.
Let's look at what the file system operations are. The question is, what are the contents of the file system API? We have things like mount and unmount, and these are the calls that invoke the file system code in the first place. So if a file system is not mounted, it's just sitting there on disk. It's not being used at all. Once it's mounted, its contents are available to the operating system. We have calls like make directory. Mkdir creates a directory, and remove directory deletes one. Open can be used to create or just open an existing file. Unlink deletes one. I'm talking about the UNIX style file system interface. And then there are a bunch more calls. So if we want to do random testing of a file system, we're going to be issuing sequences of calls in the file system API.
Before we get into the details, let's just think about why we might want to do random testing of a file system. First of all, file systems are fairly long and complicated, and so I looked at the size of 5 of the file systems commonly used on Linux, and they're all between 15,000 and about 70,000 lines of code. File systems end up being substantially large chunks of code, and what's more, the integrity of our data depends on the correctness of that file system code. So for example, if we save some critical data to the disk and the file system messes it up--that is to say, it saves it in some wrong place or corrupts it in some other way--then we're not going to get that data back ever. And so it's really, really important the file systems work well.
Now let's look at these calls and try to figure out what the dependencies are among them. First and most obviously, nothing is going to succeed until we've mounted the file system. Similarly, no file system call will succeed after we've unmounted it. So we want to be careful with mount and unmount. We don't want to just freely mix those into our stream of random API calls that we're making because if we do that, we're going to be mounting and unmounting the thing so often that the file system code is never going to get into the kind of steady state that we're going to need it to be in in order to do effective random testing of it.
Similarly, some of the other calls in a file system, like reading and writing to a file, are only going to succeed if the file has been opened. And so if we do something silly like providing completely random arguments to open read and write, then the odds of us actually successfully reading or writing to an open file are going to be extremely low. So we're going to have to do something more systematic. We're probably going to have to do something like open a file and keep track of the fact that it's open so we can randomly generate read and write calls into it. There are probably some other dependencies as well.
What we're seeing is that if we want to do effective random testing of a file system, we're going to need to track these dependencies, at least in some rudimentary fashion, in order to issue a sequence of API calls that's going to do reasonably effective random testing of a file system. The second issue that starts to come up as we perform random testing of APIs is that our tests--that is to say, a sequence of file system operations in this example-- are going to become quite large, and that's going to beg the question, how are those tests represented?
In the examples we've seen so far, test cases haven't had any sort of official existence of their own. What we've rather done is mix the software under test and the test driver code in a single piece of Python code. So now if we do this, what's the actual form taken by the test cases? And the answer has to be that those test cases only exist as ephemeral objects in memory. For most testing purposes, that's fine. The driver code creates the test cases, passes them to the software under test, looks at the results, and keeps going until it finishes.
There are a couple cases in which that's not so good. One of them is where we find a particular test case that makes our system fail and we'd like to save off that test case for later use in regression testing. And so what we end up having to do to make that work is writing some separate code which also lives in the Python program to load and save test cases. And so now this module is going to be doing file IO in order to load and save test cases, and most likely those test cases are going to be saved as text files or something. So for example, if we wanted to represent a sequence of file system calls as a text file, we could just have the word mkdir followed by whatever the arguments to mkdir are, and then the loading and saving code would need to either parse those files in order to create a test case in memory or unparse them-- that is to say, pretty print them to disk in order to save a test case.
There's a couple advantages of doing this. First of all, as I was saying, it facilitates regression testing. And the second reason is that it's often the case that when random tests become extremely large, we need to turn them into first class objects--that is to say, objects living on disk using some sort of a save and load routine--in order to perform test case reduction. Test case reduction is a piece of technology that's very often combined with random testing, and what it is and what it does is takes a large test case that makes the software under test fail and turns it into a smaller test case that still makes the software under test fail. We'll look at more detail and we'll look at test case reduction in more detail later on in this course. Now what I'd like to do...
...is investigate in more detail random testing of a specific API, and that's the bounded queue that we've already looked at a couple times in this class.
Let's take a minute and review the code for the bounded queue. So what the queue is is an abstract data type. It's providing several methods. So for example, we have a constructor, an empty call full. Enqueue adds something, dequeue removes something, and checkRep is our assertion checker that does sanity checking over the internal state of the queue. So now let's look at the simple test routine that I wrote. What it does is it creates a queue of 2 elements, adds some stuff to it, asserts that the proper values are returned, and removes those things from the queue, again asserting that the proper things were returned. And notice also that we're calling checkRep after every queue operation.
So this is a non-random test for the queue. What I'd like to do now is, using the Udacity API, write a random tester for the queue. So let's just talk about this a little bit--how should it work? It's clear that the random test, like the deterministic test here, is going to need to start off by creating a queue. It's also clear that after performing any queue operation, the random test should call checkRep. The thing that's not clear is how do we decide what API function to call, and how do we know whether the result is correct? One thing you can do is randomly call functions in the queue's API with random arguments and just keep asserting checkRep and not really care what comes out. That is to say, you can just randomly invoke queue operations in a loop and wait for checkRep to fail.
On the other hand, it's not that hard to do better. The first thing is we know how many elements the queue holds, and we also know whether any particular addition or removal from the queue succeeds or fails. And so it's pretty easy using an integer to keep track of how many elements are currently used in the queue. And what we'd like to do in that case is assert that any enqueue operation to a non-full queue succeeds and any enqueue operation to a full queue fails. Similarly, any dequeue operation from an empty queue needs to fail, as we see a little bit below, and any dequeue operation from a non-empty queue should succeed.
The third thing you can do--and you don't need to do this to pass the quiz, but it's something that would be nice-- is actually keep track of what values should be coming out of dequeue operations. And so let's just take a minute to see what that would look like. The bounded queue that you're testing is based on a Python array, but it turns out that it's really easy to emulate the operation of that queue using native Python data structures.
Basically, if you wanted to do this, what you would do is you would declare a Python list that's initially empty, and every time I want to enqueue something in our bounded queue we also append it to the list. We have to be a little bit more careful than this because if the enqueueing of an item fails, we don't want to append sums to our list. Similarly, when we dequeue something for the bounded queue, if the dequeue operation succeeds, we want to issue l.pop(0).
The way this works is when something gets appended to a list, it gets appended at the end, and l.pop(0) takes something off the beginning. So basically, we're emulating a queue using native Python data structures, and our emulated queue isn't going to be as fast as the native queue which is based on the array, but that doesn't matter for testing purposes. And so if we do these kind of operations and we do them in a random testing loop and we insert the appropriate checkReps, then we can build really a quite effective random tester for the bounded queue using not that many lines of code. So what I'd like you to do now is go to the API and write a small random tester for the queue.
Now let's really quickly look at mine. It doesn't necessarily need to be the case that yours looks very much like mine at all as long as it accomplishes an effect that's somewhat similar. So what we're going to do here is set N, the size of the queue, to 4, create a new queue of that size, and immediately call checkRep because basically, every time we do a queue operation, we'd like to do the checkRep. So if anything goes wrong with the queue, we detect it as early as possible. We also initialize an l to be empty. So this l and the q are always going to contain the same number of elements in the same order, so l and q are always going to have the same contents. So now we're going to do 10,000 random tests, and each test is going to be randomly either an enqueue or a dequeue. And if it's an enqueue, we're going to enqueue a random integer. After enqueueing, we call checkRep, and if the enqueue succeeded, we're going to append the item that we enqueued to our list which is mirroring the queue's contents and also increment a variable saying how many times this operation succeeded. On the other hand, if the enqueue doesn't succeed, I want to assert that the length of our list is equal to N-- that is to say, the list has the same number of elements as a full queue-- assert that the queue is full, call checkRep again, and also increment a counter indicating we performed an add operation on a full queue. We're not going to talk about why these counters matter just quite yet, but we're going to look at this a little bit later. The dequeue structure is exactly analogous. We're going to issue a dequeue operation with 50% probability, checkRep, and then we're going to do sort of the analogous empty checking operations that we did and empty checking operations if the dequeue operation failed. And if the dequeue succeeded, we're going to pop off an element off our list and also ensure that it's exactly the value that we expect. So once that loop terminates, we're going to go into a loop whose purpose is to empty out the elements of the queue and finally, after we've drained the queue of all elements, we're going to assert that our list which mirrors the queue is also empty. So that's my little random tester for the queue.
def test(): N = 4 add = 0 remove = 0 addFull = 0 removeEmpty = 0 q = Queue(N) q.checkRep() l =  # l and q are always going to have the same contents for i in range(10000): if (random.random() < 0.5): z = random.randint(0,1000000) res = q.enqueue(z) q.checkRep() if res: l.append(z) add += 1 else: assert len(l) == N assert q.full() q.checkRep() addFull += 1 else: dequeued = q.dequeue() q.checkRep() if dequeued is None: assert len(l) == 0 assert q.empty() q.checkRep() removeEmpty += 1 else: expected_value = l.pop(0) assert dequeued == expected_value remove += 1 while True: res = q.dequeue() q.checkRep() if res is None: break; z = l.pop(0) assert z == res assert len(l) == 0 print "adds: " + str(add) print "adds to a full queue "+ str(addFull) print "removes: " + str(remove) print "removes from an empty queue: " + str(removeEmpty)
What I'd like to talk about now is a little bit more detail on some of the different ways that we have to generate random inputs to our software under test. So let's start off with our diagram showing the random test case generator.
If you recall from the diagram a little bit earlier in this lesson, this test case generator has 2 inputs. It takes pseudo-random numbers from a pseudo-random number generator, and it's also predicated on some amount of knowledge of the input domain, and that's supplied by a human. We already looked at one example--testing the adder-- where essentially no domain knowledge was needed. That is to say, the random test case generator pretty much took its pseudo-random numbers and used them directly as test cases. We looked at a different example--that is to say, the credit card number generator-- where pseudo-random numbers, part of the input, and they weren't directly parroted to the output but rather they parameterized an algorithm which created valid credit card numbers. And of course it's possible to carry that kind of logic farther, as in the example of where we're testing a file system and we need to generate valid sequences of API calls or testing a web browser where we need to actually generate well-formed HTML in order to test certain parts of the browser.
But it turns out all of these ways of generating input are variations on a single theme which we call generative random testing. Generative random testing simply means that inputs are created from scratch. There's an entirely different approach called mutation-based random testing. Inputs are created by randomly modifying existing non-randomly created inputs to the software under test. And so what that does is changes our diagram a little bit so that we have a separate source of input which is the existing test case. One thing we might ask ourselves is, which approach is better? As far as I know, there is no cut-and-dried answer to that. Sometimes generative works really well; sometimes mutation-based testing works really well, and it really just depends on the situation and how much time you have to spend. In general, my view is that generative testing might be a little bit better at ferreting out really weird errors, but it's a lot more work to create entire random tests-- at least if there are sophisticated constraints on their form--from scratch. Mutation testing, on the other hand, has a different strength, so let's take a look at that.
So here we have the figure that we know and love of the input domain for some software under test. If we think about what the generative random tester is going to do, what it's probably really going to do is cluster them in some part because it's very hard to cover the actual full space of inputs. But in any case, they're going to be spread out over some part of the region.
What a mutation-based random tester is going to do is start with some known input, and what it's going to do is randomly modify it, and so it's going to end up with test cases that in some sense are kind of in the same neighborhood as the original input. And so our mutation-based random tester is going to be able to access points like this. If we start with a different input, it's going to be able to get points like this. And so what's happening is we're exploring interesting parts of the input domain, and it could be that we could have never reached this part of the input domain using any kind of a generative random test case generator that we have time to make.
So this approach is extremely useful and extremely valid. But on the other hand, it's generally limited to exploring sort of a region of the input space that's close to the starting point. One thing that you should be asking yourself right now is, what are the operators that we use to mutate test cases to create new random test cases? And so let's look at that a little bit.
The most obvious thing to do is start flipping bits. So for example, we take a Word document that we actually saved out of Microsoft Word-- maybe it's a couple megabytes long; it's some huge file that we've created-- and we flip maybe 10, maybe 100, maybe a couple thousand bits in it, we load it up in Microsoft Word, and we see if we can find a part of the range for Microsoft Word that causes it to crash.
Another thing we can do--and this is one of the techniques often used by penetration testing tools based on fuzzing-- is modify selected fields. And the model here is that our test case has some known structure-- that is to say, it's a valid HTTP request. What we're going to do is target parts of the HTTP request that are known to frequently lead to vulnerabilities in servers and we're going to selectively modify those. So we might, for example, take the request size and just replace it with some random number and see if that triggers a buffer overflow in the web server.
Another thing we can do if we have a parser for our test case is modify it using its grammar. So we can, for example, add or remove or replace tokens in a test case or also subtrees of the abstract syntax tree. So let's finish up by looking at a short mutational fuzzer. What I have here is a 5-line Python program that was made kind of famous by Charlie Miller's talk, "Babysitting an Army of Monkeys." This talk was pretty fun to watch, so I recommend that you Google that and look at it on YouTube.
What Charlie Miller claims in this talk is that he found an enormous number of bugs with this 5 lines of Python. And so what it turns out is that the 5 lines of Python are missing quite a bit of code that you need to make this work in practice, and I've added comments sort of explaining what these are. So what we would first need to do is load a file into a buffer in memory. And so the file that we're going to load is going to be a PDF document, a PowerPoint document, a Word document--whatever it is that we want to mutate for purposes of creating a random test case.
What this code does is first assigns into this
numWrites variable, which is basically deciding how many places inside the file that we've loaded we're going to mutate. And so now we're going to loop over that range, and for every iteration of the loop we're going to make up a random byte then make up a random location within our buffer and then mod whatever value was there with our random byte. So what does that mean? We're basically totally randomly picking some places in the buffer to mess with and messing with them, then we're going to save the buffer back to disk, run whatever our application is, so run Windows Media Player, PowerPoint, Acrobat Reader--whatever it is we're trying to fuzz-- and look at its exit code. And so its exit code from the operating system is going to tell us whether it died or whether it didn't die. And if it doesn't die, then we're going to have to wait a little bit and then just kill it. So that's a failed test case. If it does die, then we're going to want to save the buffer off into some sort of a location where we can examine it later, and in either case, then, we're going to start over.
So hopefully what would happen if we made this code real by writing the rest of it-- and this is going to be a challenge program for your assignment at the end of this lesson-- is we would basically have a large pool of documents or files or whatever that we can use for fuzzing. We'd start this thing up on some sort of a fast machine-- ideally, we'd start up a copy for every core on the machine-- and we'd go on vacation, we'd take a week. And when we came back, ideally, it would have found a bunch of vulnerabilities. It turns out that a lot of people have been using this sort of tool on the common utility programs like Acrobat Reader and PowerPoint for a number of years now. So it may be the case that if you do this, you're not going to find anything on the latest version of Acrobat Reader. And in fact, if you do find something, it's actually pretty interesting, and I hope you'll share it on the forums. On the other hand, if we want to actually get some easy successes-- and this isn't for purposes of finding real bugs; it's for purposes of understanding random testing better-- what you should do is find some old versions of Acrobat Reader or PowerPoint or whatever, get those on your machine and fuzz those. And almost certainly, using some sort of an infrastructure like this, if you wait long enough, you'll be able to find some sort of a problem in those applications.
Now we're going to talk about oracles. We really probably have deferred this topic for too long. Oracles are extremely important for random testing because if you don't have an automated oracle--that is to say, if you don't have an automated way to tell if a test case did something interesting-- then you've got nothing.
And in fact, Richard Hamlet in his sort of well-known article on random testing said that "random testing has only a specialized niche in practice because an effective oracle is seldom available." And this is something that I don't actually agree with. What we're going to see is that sometimes you have to use some imagination, but really, there are potentially quite a few oracles available and that almost always we can make something work even if it's just a weak oracle like watching out for crashes.
What I'm going to do is organize the oracles that are suitable for random testing into a collection of categories. We're going to start off with weak oracles. Weak oracles are some of the ones that are most useful in practice, but I'm calling them weak because they can only enforce fairly generic properties about the software under test. The most important weak oracle is just detecting whether or not the application crashed. What a crash usually really means is that the system under test violated some rule that the hardware imposed like, for example, memory accesses have to be aligned, or that the operating system imposed like, for example, the application isn't allowed to try to write to memory that's owned by the kernel, and in response to this violation, the operating system has decided to terminate the process in which the software under test is contained. The second most useful weak oracle is violation of a programming language rule. In Python, if you try to access an element of a list that's out of bounds and if you don't catch the resulting exception, then your Python program will get terminated. That's an example of the violation of a language rule killing an application and serving as an oracle that tells us basically that something went wrong.
Basically, most any programming language has a number of rules like this and the stronger those rules are, the more effective of an oracle the programming language runtime can serve as. And one of the reasons why C and C++ applications crash so much is because the language is an extremely weak enforcer of runtime rules, and so basically, all of the enforcement falls upon the operating system and the hardware itself.
Another form of oracle comes in the form of enhanced execution environments. I've already used the example several times, but in C or C++ we could run the compiled program under a runtime monitor like Valgrind which provides an enhanced execution environment which checks a lot of rules that aren't checked by the regular C and C++ runtime or by the operating system.
For example, Valgrind will terminate our process if we access beyond the end of an array, and in practice, that's an extremely effective form of oracle. If that kind of a rule violation happens during random testing, we absolutely want to know about it, and Valgrind provides that service. Another example of an enhanced execution environment would be one that checks for, for example, integer overflow of problems or floating point problems.
So a medium oracle is going to vary in its degree of checking power between the weak oracles we've already said and the strong oracles that I'll talk about in a few minutes. And so far in my list, and I could be missing something, but so far I only have one example of a medium power oracle, and this is assertion checks that the programmer has put into the software. The reason I call this a medium power oracle is it provides quite a lot more application-specific kind of checking than do these weak oracles. But on the other hand, it doesn't guarantee anything even remotely close to actual correct operation, and that's generally what the strong oracles are going to do. So let's look at those.
So strong oracles are extremely useful and you should always use one if you can possibly find one. We're going to have a bunch of examples here. So lets start off with one of the important ones, which is having an alternate implementation of the same specification. And if you think about it, this is what my random tester for the bounded queue did, and perhaps what yours also did. And what I mean by that is, my queue tester included a second implementation of the same queue abstraction, this one was implemented with a python list, and, this was a second implementation of the same abstraction, we could use it to actually check that the queue that we were testing gave the right answers. And so that's a very strong kind of a check.
Another example that we use in my compiler testing work. We do what's called differential testing of compilers, and what that basically means is that if we have multiple implemetations of the same compiler specification, that is to say, for example, multiple C compilers. We expect them to behave the same given equivalent inputs. Another way that we might get an alternate implementation, let's say we're looking at an older version of the software that we're testing, this is checking not necessarily whether the software is correct, but just rather whether we've broken it. And so remember for example, that I said that Intel probably could have tested the Pentium floating point unit against the 487 floating point unit.
Another kind of old version oracle that tends to be extremely effective, is after a refactoring change, that is to say, a change of our source code base that is not intended to have any functional effect, we can use the versions before and after refactoring. We can use differential testing of the version before and after refactoring, and, in that way, try to get a pretty good idea that the refactoring didn't actually break the code. The best kind of alternate implementation that you could have is a reference implementation. That is to say, some sort of an implementation of the specification that your after, that you can trust. For example, if I wanted to implement an extremely high performance python compiler, what I would do is, I would use the regular C python implementation as the reference implementation, and that would be treated as correct.
The next kind of strong oracle that I'd like to talk about is what I would call a function inverse pair. We have available to us some sort of function and also its inverse, and we can use these as a pair to do strong checking of correct behavior of the software under test.
If we remember, a couple of lessons ago I gave the example of where you could test an assembler by pairing it with a disassembler and also an exhaustive enumerator for instruction encodings. And then what we would do is take all valid instruction encodings, disassemble them into assembly language, reassemble the assembly language into machine code, and do a comparison on the output. So that's what I'm talking about when I mention function inverse pairs. If we think about it, we can find these function inverse pairs a lot of places.
Another example is encryption and decryption, compression and decompression, saving and loading of files, and this isn't as trivial as it sounds. I'm not talking about an example where we take some sort of a bitwise representation in memory and dump it literally to disk and then load it again. That's unlikely to go wrong.
On the other hand, many times when a program saves its state to disk this is a pretty non-trivial operation. We might, for example, be replacing machine pointers with some sort of an on-disk representation, and then the load operation is going to contain the inverse, but it's pretty easy to get that wrong. Or maybe perhaps we forgot to save part of the program state and then when we load it we'll end up in a different state, but we'll be able to catch that by treating save and load as a function inverse pair. Transmit and receive across some sort of a loop back interface-- that is to say, a network interface that connects a machine to itself-- can serve as a useful oracle because in some cases there are non-trivial transformations of data representation that happen as part of the transmit and receive operation. And finally, encoding and decoding of, for example, media formats serves as a final example of a function inverse pair.
I know there must be a bunch of these that I'm missing, but these are the ones that I could come up with just sort of sitting down and brainstorming for a couple of minutes. If you have a function inverse pair, use them together and try to see if you arrived at the original data. And of course, this may not always be as easy as it seems. For example, when we save pointers to disk and we swizzle them into some sort of on-disk representation, when we load them, the pointer values may have changed, but the shape of the data shouldn't. So some sort of abstraction is going to be required to compare the saved and loaded version. Similarly, the encode and decode of media files often involves a lossy encoding step. So if we encode something as a JPEG and decode it, then of course the bits that we get back are not going to be the bits we started with. But on the other hand, they shouldn't be too different. If we can somehow quantify how different they are, then we might be able to use even a lossy encode/decode pair as a test oracle for something like a JPEG encoder and decoder.
The final strong oracle I want to talk about is what I'm going to call a null space transformation. This is where we take a random test case or any test case, of course, and we make some sort of a change to it that shouldn't affect how it's treated by the software under test.
One thing we could do just as a trivial example is we could start with a simple Python function. This is the input to the null space transformation. So here we have foo of a and b. It returns the sum of a and b. And one possible null space transformation on this would be to add a level of parentheses or maybe several. This is a null space transformation because we've transformed the test case, possibly in a non-trivial way, although here we've done something pretty trivial, but we've transformed it in such a way that we know that the meaning of the test case shouldn't have changed. So now we can do something very much like differential testing, but instead of taking the same program and running it through 2 implementations of Python we're going to take 2 programs which are semantically identical and run it through the same implementation of Python. And if that implementation of Python treats this differently-- that is to say, it interprets this code in such a way that it returns some different answer-- then we've found a bug in it.
Of course we could always do things even more elaborate. Here we're calling some sort of an identity function on a-- that is to say, some function that is guaranteed to return the same value of a that it's called with, although of course that's not going to be apparent to the interpreter-- and then instead of adding b to a, we're going to subtract -b from a. So another little null space transformation. I've heard of these kind of things. I don't have any personal experience using null space transformations to find bugs, but I've heard of people doing this kind of thing, and apparently it can be useful.
So what have we done here? What we've done is seen that in realistic situations, there are a lot of potential oracles available for performing random testing. And so the conventional wisdom--at least as it's expressed in that article or that chapter on random testing--that oracles are almost never available I think is probably not right. It's often the case that if we think about it for a while, that if we're potentially willing to invest some time in creating a good oracle, then we can very often find something to use for random testing.