These are draft notes from subtitles, please help improving them. Thank you!
Welcome to lesson one of CS258 — How to Make Software Fail.
Making software fail is something which I feel very strongly about, but it's not necessarily failures that we're interested in, rather the fact that if we can find failures in software and if we can fix these failures, then eventually we're going to run out of bugs and have software that actually works reliably.
If you look at testing software as a whole — it's a very large problem — it's really kind of daunting. We can look at Microsoft's products, and we can see that Microsoft didn't manage to eliminate all the bugs in their products. We can look at Google's services, and we can see that Google didn't manage to eliminate all the bugs in their services. We might want to say to ourselves — "how can we possibly get rid of all the bugs in our own software?"
It turns out that this testing problem is not a large "monolithic" problem, but that it can be broken down into a lot of smaller sub-problems. By looking at those sub-problems, we can apply known techniques, things which people have done before, and sort of pattern match these problems to our own... Once we've become good at these smaller problems, then we can become much better testers as a whole.
And this is what Lesson 1 is about.
Now I'm going to take quick look at the question what is testing.
It's always the case that we have some software under test. When we're doing testing, we always require a source of test inputs which result in test outputs. The outputs are processed by an acceptability check. If the outputs are okay, then the test case succeeded. If the outputs are not okay, then the test case failed and we're going to have to debug. This, of course, sounds really simple.
On the other hand, a couple of the things end up being actually really hard — that is, selecting a good set of test inputs, and designing good acceptability tests --, and basically, these are what we are going to be spending this entire course talking about.
I would like to give a few facts about testing — by facts I mean my own opinions.
Finally, this is an important one, testing your own software is really hard. There are several reasons for this.
Finally, we can't write effective test cases for parts of the spec that we didn't understand correctly. So it's not uncommon for us to misunderstand things while writing software, and once we've done something like that, what we really have is a bug in our thinking. It's mirrored by the bug in the software, but those kind of bugs are extremely hard for us to find. It usually takes takes somebody else testing to find this kind of bug.
So I want to spend a little time talking about the question — What's going on when we test software? It turns out the answer is... a lot of different things are going on.
Here I'd like to start off with the output produced by a test case. It's going to pass through the acceptability check. What we're doing when we check the output of a test case for acceptability is we're running a little experiment, this experiment can have two possible results. One result is the output is okay. In which case, we are to go run another test case. The question is, what do we learn in that case? And the answer is unfortunately not very much.
At best we might have increased our confidence just a tiny, tiny bit that the software under test is correct. As it so happens, we stand to learn a whole lot more when the output is not okay. The process I'm going to talk about right now, is if the acceptability check fails, that is, the test output is not okay, and we have to discover what's happened.
Of course what we expect is that much of the time we'll find a bug in the software under test. If that's the case, we're going to go fix it. And if not, there's still plenty of other possibilities.
One of the main ones is a bug in our acceptability check. If so, we're going to fix it. If not, we should consider the possibility that there might be a bug in our specification. As is often the case, the bug in the specification is a fairly large part of debugging because the specification often is not written down or it's not written down particularly formally. It's just an English document or an understanding among a group of people. And very often, we're learning what it is that we needed to implement as we're implementing the software.
If that's not the case though, if the bug was not in our acceptability check, nor in the software under test, nor in the specification, then we still have some possibilities but they're getting more difficult to deal with.
Some of the worst debugging stories that you hear of stem from a flaw in the operating system, in the compiler, in run time libraries, or even in the hardware. Since fixing this kind of bug is often not an option for us because we've purchased these things or otherwise gotten these from some other vendor, they often have to be worked around and these can be extremely painful. If that is the case, we have a hard road ahead of us but at least we know where the flaw is. If that's not the case, we're at a bit of an impasse.
What is it that's really going on here? What's the big picture here?
The test is a tiny experiment where a failed test reveals that something is wrong. It reveals that something is flawed in either the understanding of the system or in some part of the system. And there might be a fairly elaborate process in figuring out what the problem is. These kinds of discoveries about the system which we're using, are not necessarily a problem as these are things that we need to know if we're going to create correct software in the future.
Let's take a look at some famous bugs that fit this basic flowchart.
The Mars Climate Orbiter was sent off to Mars in 1998, and there were some mis-communications, between NASA and the people they contracted out to — Lockheed Martin. By the time the Mars Climate Orbiter actually got to Mars, in 1999, there had been some problems which caused the orbiter to drift off course enough to be on a suicide mission and veer into the Martian atmosphere, break up and crash into the planet.
What happened was a basic unit error. NASA expected units in metric, for example — meters per second, and Lockheed Martin programmed in imperial units, for example--feet per second. Neither of these are wrong and both entirely valid ways to program a rocket and the underlying code was actually correct, at least in that this bug didn't have anything to do with the underlying code. However, because the NASA software was expecting units in meters per second, and the Lockheed Martin software was sending feet per second. That miscommunication caused the drift off that caused the Mars Climate Orbiter to crash into the Martian atmosphere and into the planet.
Is this a bug in the software under test — the actual software of the Mars Climate Orbiter?
Was it a bug in the acceptability test when they tested the actual software earlier on?
Was there a bug in the specification, was there actually an underlying bug, and how they planned on operating the Mars Climate Orbiter?
Or is there a bug in the underlying hardware or libraries or compiler or the operating system of the rocket? So go ahead and mark your answer here.
S. You could perhaps argue some of these--you might have some difference of opinion here, but I wouldn't say that the bug was in the actual software under test(1) because the actual software did what it was supposed to do according to specification correctly.
Was there a bug in the acceptability test(2)--well in that the acceptability test didn't check this then yes but if you are going to use that as a metric, then you could argue that all bugs are bugs in acceptability test because they didn't catch all of the bugs which is of course infeasible.
I would say this is a bug in the specification(3) because this was a miscommunication between the engineers at NASA and Lockheed Martin. And both of their ideas were correct but they didn't communicate those ideas properly. It was assumed on each of their ends. So if they had put that information explicitly in this specification then we wouldn't really have had the issues and the Mars Climate Orbiter wouldn't have had at least that bug. Maybe something else would have cropped, or maybe not.
I also wouldn't say that there is a bug in the underlying OS, compiler, libraries, or hardware(4). At least this bug isn't an example of that and we don't have any evidence of that.
There might have been other bugs, but the one which caused the crash was definitely a bug in the specification.
What I'm going to do now is go over a fixed-sized queue data structure that's going to serve as a running example for some of this lecture, and we're also going to keep using for a couple of our exercises for the next several lessons.
The way this data structure works is, it supports an enqueue operation, a dequeue operation, and enqueued elements to dequeue and FIFO order. The FIFO is first in first out--so if we enqueue 7 and then also 8, the first thing that we dequeue. will be 7 then 8, and if we try to dequeue again--if we try do dequeue an element with an empty queue, we're going to get some sort of an error.
That's our fixed-sized queue, and here is how we can start to implement it in code.
class Queue: def __init__(self, size_max): assert size_max > 0 self.max = size_max self.head = 0 self.tail = 0 self.size = 0 self.data = array.array('i', range(size_max))
So in the implementation side--what we have here is a Python object so it's called queue and the constructor for that object is going to set up the data structure because structure for that object is going to take a size max argument--that's going to determine the largest number of objects that can be stored in our queue and it's going to set up the data structure.
So first it's going to make sure that size max is greater than zero.
assert size_max > 0
It's going to save a temporary copy of that, initialize some head and tail pointers, a size variable which stores the number of objects currently in the queue, and finally we need to reserve some space for the queue elements themselves.
self.data = array.array('i', range(size_max))
You can see here is we're allocating a Python array, so one implementation option for a queue in Python would be to just use a Python list and that would be basically trivial. Python list already is pretty much natively support enqueue and dequeue operations, the problem with the Python list is they're dynamically allocated, they are dynamically tight and that makes them not very fast.
So by allocating a fixed-sized storage region of statically tight memory, where the "i" here means that our queue is only going to be able to store integers we can avoid some of Python's dynamic checks that makes the queue slow. So in some cases, a queue based on a Python list is perfectly fast but on the other hand, in some benchmarking that I did, this statically sized, statically tight queue is more than 10 times faster than a queue based on a Python list.
The first method that queue supports is the empty method and this simply returns true if self.size equals zero
def empty(self): return self.size == 0
--so of course the empty queue is the one that currently contains zero elements. Very similarly, the full method returns True if the current size of the queue is equal to the maximum size of the queue.
def full(self): return self.size == self.max
So now let's look at a couple of routines that are slightly trickier.
def enqueue(self, x): if self.size == self.max: return False self.data[self.tail] = x self.size += 1 self.tail += 1 if self.tail == self.max: self.tail = 0 return True
The enqueue method is going to take an argument x. Here, x is an integer which we want to try add to the queue. The first thing this method is going to do is check if the current size of the queue is the maximum size--if so the queue is full, then we're going to return False.
if self.size == self.max: return False
If we passed this test, the queue is not full and we have room. So the next thing we're going to do is put the argument data item into the queue at the location pointed to by the tail.
self.data[self.tail] = x
Now let me show you a little bit of about how our representation works. So for demonstration purposes, we're going to look at a 3-element queue and initially it's going to have a head and a tail according to the first queue element-- that is the queue element with index zero and also its size is going to be zero.
To enqueue an item, the first check will be useful--no it's not because its size is zero. We go ahead and put the item--let's say it's the number 7 in the queue element pointed to by the tail then we're going to increment the tail, now the last thing we have to do to enqueue an element is increase the size of the queue to be 1.
Okay, now let's go look back at the code.
self.data[self.tail] = x self.size += 1 self.tail += 1 if self.tail == self.max: self.tail = 0
Seeing here at the code, we can see that we put the element in the queue, we increased the size of the queue, we moved the tail to point to the next element and the only thing that's left, the only bit of logic that's sort of tricky here is if the tail of the queue points passed the end of the queue--that is to say if it's equal to the max and so remember what the zero indexed array, the maximum way in is going to be one passed the end of the queue--we're going to reset the tail to point at the zero element of the queue-- that is to stay at the beginning.
Now the dequeue method is very similar
def dequeue(self): if self.size == 0: return None x = self.data[self.head] self.size -= 1 self.head += 1 if self.head == self.max: self.head = 0 return x
--first if the size of the queue is zero then the queue is empty, we're not going to be able to dequeue an item. And so what we're going to do in this case is return the Python None type. So none of the special data types supported by Python we can often use to indicate that we don't have anything--we don't have any actual value.
If we pass that test, then there is something to return. So what we're going to do is store the item from the head of the queue in a temporary variable. So x is going to get 7. We're going to decrement the size of the queue. We're going to move the head point out to point to the next element and then using logic very similar to the tail pointer in the enqueue function, we're going to wrap the head pointer around if it's gone passed the end of the queue.
So let's go back to the drawing and look out and see how this plays out.
So we're going to return 7, decrement the size, and make the head element point to the next element of the list, and we're not going to bother erasing the 7 we returned but we're going to have to make sure that our queue logic never returns this dead element.
So let's take a very quick quiz just to make sure that you understood all of that.
Reference Code for Question:
class Queue: def __init__(self,size_max): assert size_max > 0 self.max = size_max self.head = 0 self.tail = 0 self.size = 0 self.data = array.array('i', range(size_max)) def empty(self): return self.size == 0 def full(self): return self.size == self.max def enqueue(self,x): if self.size == self.max: return False self.data[self.tail] = x self.size += 1 self.tail += 1 if self.tail == self.max: self.tail = 0 return True def dequeue(self): if self.size == 0: return None x = self.data[self.head] self.size -= 1 self.head += 1 if self.head == self.max: self.head = 0 return x
Q. First I'm going to make a 2-element queue. I'm going to set temporary variable r1 to be the return value of enqueuing an element, and set r2 to be the returned value of enqueuing another element, r3--a third element, r4 to be the value returned by dequeuing an element, r5 to be the returned value of dequeuing another element, and finally r6 to be the returned value of dequeuing yet another element.
What are the values of these 6 variables.
a) 6, 7, 8, 6, 7, 8.
b) True, True, True, 6, 7, None.
c) True, True, False, 6, 7, 8.
d) True, True, False, 6, 7, None.
S. The answer of the quiz is D.
The first enqueue operation returns true because in a 2-element queue, where we haven't inserted many elements there's space to insert an element. The second enqueue call succeeds for the same reason. The third enqueue call fails and returns false, because there was no room left in a 2-element queue after 2 enqueue operations had already succeeded.
Because we have a FIFO queue, the first dequeue operation returns the first element that we'd added to the queue--that is to say 6, the second dequeue operation returns 7, and the third dequeue operation, which is now called on an empty queue, returns None.
Okay, so now that we know that while sometimes we can be clever, a lot of the time we just can't test everything we would like. How useful is any individual test case? Is there any way to make a test case that is a little bit more useful in practice?
Let's look at this real quick--we just enqueued 7 into our fixed-size queue and then immediately dequeued it. So we should get 7 back out and if that's the case then we print success and otherwise we have an error.
How useful is this? Let's go ahead and turn this into a quiz.
Q. The question is, if we passed this test case, what else have we learned about our code, or have we learned anything? And your options are
S. Okay, so if we passed this test case, then obviously, yes, we've learned at least that much. It isn't completely useless.
Does our code pass any test case where we replaced 7 with a different integer? Well, maybe not--if you replace 7 with a gigantic integer-- something you can't even store on your computer it's so big-- then you are probably not going to get the same behavior out of these test cases with something simple like 7. We can't necessarily say this. I would say that as well.
Now, our code passes many test cases where we replaced 7 with a different integer. This is true. If we were to replace 7 with say 20 and replaced it down here as well, then yes--this should behave essentially the same. And it should behave essentially the same for many, many integers. Certainly not integers that have very large stored space requirements, or maybe it doesn't store integers past a certain static typing. For example, if you're storing a type in your queue that is only able to store very small integers, let's say, shorts which are typically much smaller, then yes, this won't work.
So the question is--what exactly do we learn, and how do we make test cases that represent a large number of other potential test cases so we don't have to write a whole lot of very small simple test cases?
The question I'm asking here is given this test case that we were talking about,
is it possible that I can conclude without trying it that this test case will succeed?
What this test case does is enqueues 7, sleeps for a while, dequeues an element, and then checks if it was 7, and the answer again is yes. Again, it depends on making an argument about the actual form of the code for our queue. If you recall that code, the code didn't contain any dependencies on the time at which queue request are made.
Just to be clear, we're trying to make arguments that a single test case is representative of a whole class of actual executions of the system that we're testing. And in the general case, we can't make those arguments. We can only make these arguments for the specific queue code that we're talking about because we know that, for example, it doesn't depend on the integer value that we're adding. It doesn't depend on the time at which operations are made and that's why these arguments actually work.
What we're actually getting at here is a fairly deep idea. The idea is we have a single test case that represents a point in the input space for the system that we're testing. By running the code, we mapped that point in the input space to a single point in the output space. But the problem is, there is an extremely large input space and we can't test it all.
What we're trying to do here is build up an intuition for when we can make an argument that, in fact, we haven't made an argument about. The mapping of a single input to a single output. But rather, some class of inputs that are closely related, that are somehow equivalent for purposes of the system under test, and that if the system executes correctly for the single input that we've tried, it's going to execute correctly for all of the inputs in this particular bit of the input space.
Let's look at a few ways in which the arguments that we've made here could've gone wrong. If it's ever the case that our queue could work for some integer, but not work for general integers, and certainly, it is the case. For example, if the Python runtime contains some sort of a bug where large integers were truncated, that it would obviously be the case that if we put one of those large integers into our queue, it will get truncated somewhere in the queue processing here, and what would come out wouldn't be what we had tried to put in.
Similarly, if the Python runtime contained some sort of a garbage collection bug, that corrupted heap memory--it might be the case that sleeping would give the Python interpreter time to corrupt our value, time to corrupt our queue, and we wouldn't get something out. But in both of these cases here, we're making arguments about the broader runtime system, and not the actual code under test, remember that the arguments that we made here, they're really only about the actual code that we were testing.
So in summary, we can make these kind of arguments but we have to be extremely careful that the assumptions that those arguments are based on are good ones. What I'd like to do now is give you a little quiz about these concepts.
Here's the quiz I'd like to give you--what I've provided is a queue code we've already seen and also three test functions. The first test function is the one that I wrote and this is put here really quickly.
def test1(): q = Queue(3) res = q.empty() if not res: print "test1 NOT OK" return res = q.enqueue(10) if not res: print "test1 NOT OK" return res = q.enqueue(11) if not res: print "test1 NOT OK" return x = q.dequeue() if x != 10: print "test1 NOT OK" return x = q.dequeue() if x != 11: print "test1 NOT OK" return res = q.empty() if not res: print "test1 NOT OK" return print "test1 OK"
It creates a three-element queue, checks that it's empty that is to say It calls the empty method, and if it's not empty, that is to say If the empty method doesn't return true, we print test1 NOT OK and return. Otherwise, we keep going. So we enqueue a 10, make sure that that succeeds. Enqueue the value 11, make sure that that succeeds. Call dequeue method which takes the first thing out of the queue that we put in which is 10, so we make sure that it's 10. Dequeue the next element, make sure that it is 11, and now we check again that the queue is empty. If all of these are true, then we print test1 OK.
So, that's my test function. Your test functions are test2 and 3
def test2(): print "test2 OK" def test3(): print "test3 OK"
--so just make sure that everything works here. We run the code and we see the test1 OK, test2 OK, and test3 OK.
Q. Your assignment is to write functions test2 and test3. Previously I gave some examples of several different tasks for the queue which were equivalent to each other in the sense that if we run one of them then it would just about be as good as running the other one. So, if you remember, for example, we can enqueue some numbers and dequeue them and check if they are correct. The queue code would behave exactly the same if we do the same exact operations but with sleeps in between them.
So, you're going to do here is the opposite of what I was showing. What I want you to do is write test functions two and three so they're not testing equivalent functionality in the queue. This should be pretty easy but let me just go up and look at the code and we'll get some ideas about how that might work.
So, what kind of behaviors in the code could you trigger that I didn't trigger? Well, it's pretty obvious. For example, I never triggered the four case for enqueue. I never triggered the wrap around case, I never triggered the empty case for dequeue, nor did I trigger the wrap around case with the head point. So, given those ideas, write functions test2 and test3 such that they are not testing equivalent queue functionality as the test1 function that I wrote.
S. We have the original test function test1, remember that we said that some of the behaviours this doesn't test for are all of these if statements.
def enqueue(self, x): if self.size == self.max: .... if self.tail == self.max: .... def dequeue(self): if self.size == 0: .... if self.head == self.max
Lets just have test2 test these two main ones that is if the size is equal to max, then we will return False, and if the tail is equal to max, then we will reset the tail to zero. So let's go ahead and build that.
def test2(): q = Queue(2) res = q.empty() if not res: print "test2 NOT OK" return res = q.enqueue(1) if not res: print "test2 NOT OK" return res = q.enqueue(2) if not res: print "test2 NOT OK" return res = q.enqueue(3) if q.tail != 0: print "test2 NOT OK" return print "test2 OK"
Okay. So in test2, we create a queue of size two and we make sure that it is empty to begin with just as a good start, and we enqueue 1 and enqueue 2. We make sure that these both returned true. And then we enqueue 3 or we attempt to enqueue 3, which shouldn't work because remember it's only a queue of size 2--it can only store two elements. So, we should be able to check that, and the tail, queue.tail, is not equal to zero. If it doesn't, then we haven't reset it to zero like we said we did right up in enqueue where it says:
if self.tail == self.max: self.tail = 0
So, that would be a problem. And we check that with
if q.tail != 0
Otherwise, we print out test2 as okay. So, we go ahead and run this. Okay. Now we see that we print out test2 as okay, so everything is good.
So now, let's go ahead and check out test3. So for the third test, remember let's take a look at dequeue function.
def dequeue(self): if self.size == 0: return None x = self.data[self.head] self.size -= 1 self.head += 1 if self.head == self.max: self.head = 0 return x
We want to test that when you dequeue an empty queue, you should return none since there is nothing to return. Then, we also want to check that if we dequeue and we have gone past the end of the queue, that is the head pointer has stepped out into a larger element than the total length of the queue, the size of the queue, then we want to reset it to zero. If we go forward or past the end then we want to wrap that around.
Let's take a look at test3.
def test3(): q = Queue(1) res = q.empty() if not res: print "test3 NOT OK" return x = q.dequeue() if not x is None: print "test3 NOT OK" return res = q.enqueue(1) if not res: print "test3 NOT OK" return x = q.dequeue() if x != 1 or q.head != 0: print "test3 NOT OK" return print "test3 OK"
The first thing we do is we initialize a queue of one element, it can only store one element. It's not a terribly exciting queue. So first, we try to empty it, which shouldn't work. If it does, then we've got a problem. Then, after that, we try to dequeue, remember this should return None. If it doesn't return None, then again we've got a problem. If it does, then we continue. Now enqueue one element, so at this point, the queue is full and head should be at one, which is the max size of the queue. So now, when we try to dequeue, we should reset queue.head to zero. If we don't, then we've got a problem.
We go ahead and check that the element that we get back out from dequeue is also one--just to be complete. So, if we get through all of these, then we print out test3 okay. So, let's go ahead and run that and we see that, yes, test3 is okay.
So, you might have tested different things. You also might have tested just some of these and that's fine. The goal was just to try to figure out different places in the code that you could test against that we hadn't already given you. And that's a really important thing to learn how to do as a tester.
Now we're going to talk about something that's pretty important in practice, and that's creating software that can be tested effectively. We're going to start off with things that are kind of, maybe a little bit fuzzy and unquantifiable. And we'll get towards more very specific recommendations a little bit later.
I'm going to start off with some sort of generic recommendations, and first, it's to write clean code, and, of course, we all know that's pretty much impossible. So when it's not possible to write clean code in the first place, we need to refactor code whenever necessary. So third, of any software module that we're writing, we should always be able to clearly describe what it is that this module does and exactly how it interacts with other code. If we find ourselves in a situation where we can't clearly describe all of the dependencies that our module has with other modules in the system, then we're probably in a situation where we've created software that can't be tested effectively.
Code should never have any more threads than it absolutely needs to get its work done. The software should not be a swamp of global variables. Global variables make software really hard to test, because global variables are implicitly inputs to any software that reads the global variables. And so, in addition to testing software on its explicit inputs, if we have a lot of global variables, we're going to be forced to test a lot of values of those, and this can really make testing difficult.
Our code-- and this applies probably more to C and C++ and Java than to Python--shouldn't be a soup of references and pointers, because these are another way to create unintentional connections between different parts of the code, breaking modularity and again making it really hard to test our code. Most of the time, the modules that we'll create should have lesson tests, and when applicable-- that is to say when our modules are interfacing with other modules that might fail-- we should add support for fault injection. Finally, testable software often contains a large number of assertions, and so now we're going to expand on that topic.
Assertions are an important enough topic that we're going to want to cover them separately. An assertion is an executable check for a property that must be true of your code.
For the square root example, we define a function called "sqrt" which takes an argument. We're going to use that argument to compute our result, and we're going to assert that the result is greater than or equal to zero.
We're going to define a Python function called sqrt, which takes an argument. There's going to be some code at the top which computes the result And now what we're going to assume is that if the argument to the square root was negative--this bailed out with some sort of exception--so if we reach this line in the code,
assert result >= 0
then we successfully computed a square root result, and what we're going to do is assert that that result is greater than or equal to 0 and then return it. Why did we assert that the square root was always greater than or equal to 0? Because we know, by definition, our square root function is returning the positive square root of its argument. At this point in the code, we know that we must've computed a positive number. So it's good to go ahead and assert that.
Let's look at some basic guidelines for putting assertions into code.
Rule 1--is that assertions are not for error handling. So, for example, it would've been a mistake at the beginning of this routine probably to assert that (arg) was greater than or equal to 0. That would reflect a condition that we want to handle with an exception-- that is to say with Python's established error handling mechanism-- and not with an assertion. The assertion that we put into the code asserts the result of our logic that we wrote as seen. It's not asserting something about the behavior of some other entity in the program. That would generally be more in the domain of error checking.
Rule 2--and this is really important-- is never make assertion calls that have side effects. An example of a side affecting assertion would be asserting that a function foo returns the value of 0 but where foo changes the value of global variable. The problem with side effects and assertions are when we turn on optimization in Python, it's going to drop all of the assertions in our program. So what we're going to have is a program which happens to work correctly, because the assertion changes a value in some way that's needed. But, on the other hand, when we turn on optimizations and drop all of the asserts, then the program's going to stop working. And this makes the assertions worse than useless. This is almost the worst possible thing you can do with an assertion is have an assertion in your code that changes the global state in some observable fashion. So we definitely don't want to do this.
Now rule 3 is don't put silly assertions into our code. For give an example, asserting that 1 + 1 = 2, is it conceivable that in some Python program that 1 + 1 != 2? Sure, it's conceivable, if the Python interpreter is incredibly broken. But if the Python interpreter is that broken, nothing is going to work. Our program isn't going to run at all! So in any case, there's no point whatsoever in doing a silly assertion like this.
The best assertions are those which check a nontrivial property that could be wrong but only if we actually made a mistake in our logic. It's not something that could be wrong if the user did something wrong, and it's not something that's wrong that's just completely silly to check. Now let's write some assertions.
Looking at the queue code we wrote previously, and what we're going to do is try to figure out what kind of assertions we could add to this queue code which would make it more robust with respect to mistakes.
class Queue: def __init__(self,size_max): assert size_max > 0 self.max = size_max self.head = 0 self.tail = 0 self.size = 0 self.data = array.array('i', range(size_max)) def empty(self): return self.size == 0 def full(self): return self.size == self.max def enqueue(self,x): if self.size == self.max: return False self.data[self.tail] = x self.size += 1 self.tail += 1 if self.tail == self.max: self.tail = 0 return True def dequeue(self): if self.size == 0: return None x = self.data[self.head] self.size -= 1 self.head += 1 if self.head == self.max: self.head = 0 return x def checkRep(self): return
We go through the code and what you can see is at the end, I've added a checkRep function. And checkRep is a function which stands for check representation, and is a function that we commonly add to a data structure or to other functions that checks that the variables in the program are self-consistent. What it's going to do is basically try and terminate the program if some invariant which we know should hold over the program's data structures fails to hold.
What I first want to do, though, is run this program. And when I run it, what's going to happen is some testing code that I haven't shown you is going to create a queue and call its methods in a loop, and it's going to look at the output of the queue and try to find any mistakes. The test function has the test oracle built into it that I wrote that's going to try and see if our queue is correct. What we hope is that it's correct. Also, we know that the checkRep is going to succeed, because it contains no assertions right now, so let's run it.
Once our queue passes the test harness, let's see if we can harden up the queue a little bit by thinking of a good assertion to add to it. So I think that the first assertion that we're going to add is going to be over the size variable. If you recall from earlier, the size variable tracks the number of elements currently in the queue. So let's go ahead and add an assertion about the size.
One thing we know about the size variable is it should never be negative, because we can't possibly ever have a negative number of elements in a queue, so let's assert that self.size is greater than or equal to zero. And the other thing we can assert about self.size is that it never exceeds self.max, which is the constant variable, which counts the most elements that could ever be in our queue.
def checkRep(self): assert self.size >= 0 and self.size <= self.max return
So here we're asserting self.size is not negative and doesn't exceed self.max, so now when I run the test harness again, we're going to again check the queue for functional correctness and also call the checkRep. Let's see what happens, so it printed a finished over the old finished, and so that's good.
Okay. That invariant was good. So what I want to do next is I'm going to break our queue, so I'm going to break our queue by making the enqueue fail to properly reset the tail variable when it overflows. What we have in the enqueue logic that when self.tail is equal to self.max, we reset it back to 0. And I'm going to set it to 1 instead. So let's see what happens when we run the queue now. It failed, and what it did is it failed in an assertion inside the test harness. This was in my test function down here on line 90.
Traceback (most recent call last): File "vm_main.py", line 26, in import main File "/tmp/vmuser_tmmfgkajxz/main.py", line 90 inAnd t test() File "/tmp/vmuser_tmmfgkajxz/main.py", line 80, in test assert z == res AssertionError
See, we only have 40 lines of this code. So way below is my test function, and we failed there. What this means is we've messed with our data structure in such a way that it actually returns a wrong result, so this is the result not being equal to the expected result assertion. And so what we've done is we've created a data structure that's broken, and when it returns wrong results, this is going to be deeply buried in some program that we care about, so it's going to return the wrong results to some deeply buried part of the program, and probably that program is going to keep running for a while. And it's going to start to do very bad things. It's going to lose track of webpages or it's going to start to do other things that we don't like. So what we want is to write a tighter set of assertions for our queue so that our checkRep can catch this before we actually return the wrong thing to the user.
Q. So this is your API quiz. Your API quiz is write an assertion that goes under mine in the checkRep which catches the bug in the queue before it can actually misbehave. The trick you're going to have to do is since our buggy code sets self-tail to 1, that's in range, so another invariant besides saying self.size is between 0 and self.max inclusive is we could've asserted the same things for self.head and self.tail, but the bug here is not going to violate those particular checks. Rather, what it's going to violate is a relative check between the values of head, tail and size. So that's what you should be thinking about to add to your assertion here, so what I want you to do is come up with the correct data structure invariant to assert here, so that that assertion fails when we run our queue tester before we actually return a wrong result.
Additionally what we're going to do is we're going to check your assertion against a correct version of the queue, so we're going to take your checkRep code and put it in the correct version of the queue and check that it doesn't fail spuriously. So what we want to do is write an assertion that's tight enough that it catches the error that we've introduced here but not so tight that it always fails. Okay. I hope you'll have fun with this.
S. I'm not sure what assertions you added, but let's look at some of the ones that I decided to add. One condition we can do is if self.tail is greater than self.head, then the difference between them has to be the size. That is to say, the distance between the head and tail has to be the number of elements currently in the queue. On the other hand, if the tail is less than the head, then that means that the tail has already wrapped around, and in this case the difference between head and tail has to be equal to max minus the size of the queue. The third case that we have here is that if the head and tail are at the same place, then either the queue is empty or the queue is full.
def checkRep(self): assert self.size >=0 and self.size <= self.max if self.tail > self.head: assert (self.tail - self.head) == self.size if self.tail < self.head: assert (self.head - self.tail) == (self.max - self.size) if self.tail == self.head: assert (self.size == 0) or (self.size == self.max) return
Let's go ahead and run this on the buggy code and see what happens. So, we violated an assertion, and better yet, we violated an assertion in the checkRep function. So, that's good. What this means is by violating an assertion in a checkRep function, we caught the bug, we're terminating the program, and we can debug the code, using sort of localized debugging inside this data structure. We don't have to do some huge tracking down of where the bad values came from. So, we want to do one final check here to make sure that my assertions are right. I want to go to my buggy key code, fix it, and run the code again. What we want is all of these assertions should succeed during the test cases. Okay, good. So, that was the case.
Now we're going to talk about the question why assertions? What are assertions really buying us? This boils down to three or four basic reasons.
Something you might be wondering is do real developers put assertions into the code that they write? It turns out that often they do.
I took a quick look at the GCC and the LLVM code bases. GCC is the GNU compiler collection. It contains not only the well-known C compiler and C++ compiler, but also a Java compiler, and an Ada compiler and a number of other tools, and the GCC source code base contains more than 9000 assertions.
Now, the LLVM compiler suite contains about 13,000 assertions, and that's over about 1.4 billion lines of code for a total of about 1 assertion per 110 lines of code. Here I'm just counting raw lines of code, not source lines of code. This includes blanks, comments, and everything.
What this tells us is that the LLVM and GCC developers have made a pretty serious commitment to checking assumptions, preconditions, post conditions, and invariants in the code that they wrote. One thing that I've done personally is reported a lot of compiler bugs to both the GCC and LLVM products, and one thing I've learned is that much of the time these bugs show up as assertion violations. Not always--sometimes these compilers seg falt or have other problems. But most of the time they're assertion violations. These assertions are actually succeeding at accomplishing the goals that we talked about on the left side of the screen here.
One further thing that I want to talk about is, do we disable assertions in production code? For example, when we run the Python interpreter with the "-O" option, which causes Python to enable some optimizations, it disables assertions in order to make our code run faster. The question we're asking here is is that a good thing? It turns out that there are arguments we can make in both directions.
Let's first look at reasons why we might want to disable assertions. One of the main advantages of disabling assertions would be that it lets our code run faster. Another advantage--and here we're starting to get into something, which really depends on what we're trying to do with our system-- is that code running without assertions is less likely to stop running. The code keeps going even after some sort of condition is found within the code that would have triggered an assertion violation if assertions had been enabled.
Here we really have to ask ourselves the question what is it that we're trying to do with our system? Is it better to keep going or is it better to stop? Remember that keeping going after some condition is true that will lead to an assertion violation may lead to a completely erroneous execution. On the other hand, possibly, that's better than actually stopping.
Now let's move to disadvantages of disabling assertions in production code. If our code happens to rely on some side effect performed by an assertion-- that is to say, if whoever wrote the code that we're using has violated one of the rules for assertions that I gave you-- then turning off assertions is going to break the code. Not only is it going to break the code, but it's going to break the code in an extremely confusing way, because lacking assertions, we're going to have a very hard time detecting the error except that it's probably going to cause our system to crash in some confusing way or just to do completely the wrong thing. This is a real risk of turning off assertions in large systems.
The second reason is even in production code it may be better to fail early rather than keep going. Really, it just depends on what our system is doing. The question is do our users want the system to die or do they want the system to give them some completely wrong result?
What I'm going to do now is read a short quote from Julian Seward, the author of the Valgrind Tool. If you haven't used Valgrind and if you're a serious C++ developer, you really should. It's an amazing tool that runs dynamic checks on your running program. and it looks for errors such as null pointer references, out-of-bounds array accesses, and other things. He said,
"This code is absolutely loaded with assertions, and these are permanently enabled. As Valgrind has become more widely used, they have shown they worth, pulling up various bugs which would otherwise have appeared as hard-to-find segmentation faults. I am of the view that it's acceptable to spend 5% of the total running time doing assertion checks."
So, what Julian is saying here is that he's going to cost everybody 5% of total running time in order to make testing and debugging of Valgrind easier. I've included this somewhat long quote because I really agree with Julian here. I think it's generally worth our time to try to make our code fail early, fail fast, rather than keeping going and failing in confusing ways later or even providing people with completely wrong results.
On the other hand, let's imagine we're working at NASA or someplace like that, and we're writing software that's going to be controlling a spaceship as it lands on Mars or some other planet.
We have to ask ourselves the question do we want to enable assertions checks for this kind of a mission? I actually talked to a NASA engineer some time ago who had worked on the software for one of the Mars missions. What he told me was for most of the mission-- that is to say for the launch phase, for the cruise phase, for orbiting Mars-- they had plenty of assertions enabled in the actual mission software-- that is to say, in the software running on the spacecraft. On the other hand, for a short period of time during which the spacecraft was actually landing they disabled all of the assertions on the idea that they got only one shot to land on the planet and that if anything went wrong with the software during the landing process it was better to just try to keep going, because an assertion violation leading to a system reset could cause the lander to become unstable in such a way that it wouldn't have been recoverable.
That gives us a summary. If you're doing something so critical that it keeps going that it resembles landing a spaceship on Mars, then go ahead and turn off assertions in your production software. Otherwise, you're probably better off leaving them enabled.
So there's always some software under test. And this might be something huge like the Linux kernel, it might be something really small like a new square root function that we wrote, but, in any case, what we want to be able to do is draw a box around the software that we're testing.
In general, the reason that we implemented the software is to provide some sort of a service that we're going to use in other software that we're going to provide to a user. And what that means is the software under test is providing some set of APIs-- that just stands for application programming interfaces-- and all this means is the software under test is going to provide a set of function calls that can be called by other software. And of course, that's what we want to be testing.
So for the next little while, let's assume a really simple piece of software under test. Let's assume that it's just a square root routine that we have implemented, and we want to make sure that it works. So to test this code, all we're going to need to do is pass values into the software--
so here we're calling square root routine with 9 as an input--and see what it returns, see if it returns the thing that we expected. Let's just take a quick quiz here. If I call a square root routine--and here let's assume that we're not talking about the Python square root; we're just talking about a general square root function that we're implementing for some hypothetical piece of software.
Q. So we invoke the square root routine with 9. And what should it return? You should answer with all of the answers that apply.
S. The answer to this quiz is any answer is acceptable. It's perfectly possible that the answer we expected from square root of 9 is 3. In fact, that's what we would expect most of the time. But also, -3 is a perfectly correct answer for the square root of 9. The issue that we're getting at here is what's the specification for the software under test, for our square root routine? Is it defined to return the positive value? the negative value? Can it return either of them?
This is what we're getting at here is that software always operates under some sort of a specification. Of course, most of the time, the specification isn't actually available to us in some sort of a clear technical document. Often the specification is vague or it's written down in English. Part of our job as a tester is to help refine the specification by figuring these things out, by figuring out what the acceptable answers really are.
Let's continue with our example. We're testing the square root routine, and now we're going to test it with the input value of -1. So we call square root of -1, and the question is, what should it return? Let's make this into another little quiz.
Q. Our possible answers are
d) throw some sort of an exception;
f) just crash the machine.
I'd like you to answer with all that apply as reasonable results for square root of -1.
S. Let's go through the answers.
So the answer is some combination of b, c, and d.
Again, what's happening here, is just running a simple test case is forcing us to think about the specification for the software under test. In fact, this is really, really common that as soon as we start testing a piece of software, we start to really have to think about what the software is actually supposed to be doing. And this is a good thing. Often when we're testing software, we're not so much just looking for bugs in the software, but we're helping to refine the specification for the software under test.
What I'd like to do now is look into this issue in a little bit more detail. If we think about a piece of software as a mathematical object-- and of course that's very natural for something like a square root function, but we can do this for any piece of software-- if we think of a piece of software as a mathematical object, what we'll find is the software has a domain of values. The domain of values is just a set of possible values that constitutes the set of inputs that the software is designed to handle. Correspondingly, every piece of software also has a range. This is the set of possible outputs for the software.
So let's look at what the domain and range are for the square root function. Earlier we said that we probably would like to return the positive answer for a square root, and we also said that we were probably not going to be able to compute the square root of a negative number. We can account for both of those facts in the square root's domain and range by making it a function from non-negative real-valued numbers to non-negative real-valued numbers.
Alternatively, if we're implementing some sort of mathematical software, we might want to define square root for all real-valued numbers, in which case the domain is the full set of real numbers, and then the range is the set of complex numbers.
Actual computer code is not going to be able to deal with a full set of real numbers. Most real-valued numbers are not representable on a computer at all. So instead of dealing with the reals, we're going to worry about floating-point numbers. So for example, we might want to make square root a function from non-negative floating-point numbers to the set of non-negative floating-point numbers. But this isn't really very convenient. What should we do if somebody calls square root with a negative number? What languages like Python often do is declare square root over the full range of floating-point numbers and give outputs that are the positive ranged floating-point numbers unioned with a different behavior, which is to throw an exception.
And so this bottom domain and range for floating point is the situation we actually get in many real programming languages. Now given the specification, let's get back to software testing and ask the question, should we test this square root function with negative inputs? Let's bring the discussion back to testing and take a little quiz.
Q.The question is, given the specification here where square root maps the full set of floating-point numbers onto either a floating-point number or a math exception, should we test this code with negative inputs? And the answers are either
So take a minute now to think about your answer.
S. The answer is that yes, we should test it with negative inputs. Negative inputs are part of the domain of a function, and one of the basic principles of testing is we should test computer programs with values sampled from their entire domain. Sometimes as a software tester, you'll test code with an input that looks like it should be part of the domain, and the code will malfunction, will crash with some sort of a bad error, perhaps maybe not throw an exception but rather actually exit abnormally. And what you'll do is you'll go to the developer of the function, you'll go to the developer of the software under test, and you'll say, "I think I found a bug," and they'll tell you something like, "Well, the input that you tried "is not part of the domain for the software." That's usually perfectly legitimate. As long as that kind of fact is documented, as long as it's clear to the users of the function, then restrictions on the domain of functions are actually a very valuable tool in practice because otherwise, without restrictions on domain, every function that we implement, every piece of software that we implement, has to contain maximal defensive code against illegal inputs. And in practice, this kind of defensive coding is not generally possible for all internal interfaces. For one thing, it makes code extremely slow, and for another thing, it clutters up code with error checks that make it completely impossible to find the code that's actually supposed to be doing something.
So to reiterate, domain restrictions are a perfectly valid technique for keeping code simple in the case where these kind of assumptions about input domain are actually reasonable.
In contrast with software libraries and software components where the domain is clearly defined and might be a subset of all the values that you can construct using the programming language, it's often the case that higher level software artifacts don't have the luxury of limiting their domains. One good example of this is an operating system. Let's say we're testing a UNIX platform such as Linux or Mac OS X. I know we really haven't discussed this case yet, but I want to take a really quick quiz in order to start building up our intuition about testing.
This example will be a very slightly simplified version of the UNIX read system call, and this is a system call supported by all UNIX platforms, so by Mac OS and Linux and others. All that read system call does is takes a file that's already open and reads some bytes out of it into the address space of the process that calls a read.
So the read call takes 3 parameters. It takes an integer called fd, which is just a small integer referencing a file that's already open. It takes a second current parameter called buf, which is a pointer to a memory region-- that is to say, a valid memory region in the process's address space. And finally, it takes a number of bytes.
Q. The quiz that we're going to take is which of the following test cases constitutes a useful test case for the read system call? Our 4 test cases are
a). Reading from file descriptor number 1-- this is always a valid file descriptor for a UNIX process-- from the address of b, and let's assume that refers to a valid memory region, and we're going to read 10 bytes.
b). Also reading into variable b, 10 bytes, but it's specifying file descriptor -99999.
c). Reading into file descriptor 1 using a pointer to an address which is almost certainly a bad one--this is just a random hex constant I just made up-- also reading 10 bytes into there.
d). Read into file descriptor 1 and to a valid address b but reads -33333 bytes.
Write all of the test cases that you think are good ones for a UNIX kernel.
S. So the answer is all of the above. And the reason is, when we're testing something like an operating system kernel, we want to test it basically with all possible values of the parameters to the calls that it supports. And it turns out that there's a really good reason for this. The operating system boundary is a trust boundary. On 1 side of the operating system boundary we have us. We're the kernel implementers, and our job is to keep the machine running to provide isolation between users, and basically to enforce all of the security policies that operating systems are designed to enforce. On the other side of the boundary we have them. These are our users. Our users might not actually be malicious. They might only be writing buggy codes. But the point is, regardless of whether the users of the operating system are malicious or just writing buggy codes, they're going to invoke system calls like read with crazy arguments, with all sorts of invalid arguments, and it better not be the case the operating system malfunctions when this happens.
Therefore, if we're testing the operating system interface, we really, really, really need to be issuing calls like read with garbage.
And in fact, this is one of the ways that we actually test operating systems. So the Linux kernel developers-- or actually, this tool works for multiple versions of UNIX--have a tool called crashme.
Here we have the OS kernel, and here we have this program called crashme. What it does is it allocates a block of memory, writes totally random garbage into it, then it masks off all signal handlers--that is to say, system level exception handlers-- and it just jumps into the block of garbage-- that is to say, it starts executing completely garbage bytes. And what happens when you start executing garbage bytes is often the instructions are illegal, and the operating system kernel has to handle this. So over time what we end up doing is exploring--that is to say, testing-- a lot of operating system system calls, a lot of calls into the operating system, that contain really weird and unexpected values. And if the kernel goes ahead and keeps running properly and keeps trying to kill the crashme process, then the operating system kernel is succeeding.
If the kernel ever falls over, if it crashes, then we've discovered a bug. We have to go ahead and fix the operating system. And the experience is that if you take a real operating system and you test it with a tool like crashme, you'll end up crashing the kernel unless the operating system has already been hardened with a tool like this.
What we're getting at here is something very fundamental about testing. We're working towards another testing principle, and the principle is that if you have an interface or an API that represents a trust boundary, that is a boundary between part of the system that we trust and users on the other side who we can't trust to stay within the domain of the API that we're implementing, then we have to test that API with all possible representative values, not just ones that the developers happen to think are in the domain. It turns out that in practice, people are pretty bad at this-- that is to say, people aren't empirically very good at testing these interfaces with a full range of values. And this lies at the core of a lot of the security vulnerabilities that we see in real software today.
So let's look at just one more example. We're going to be testing some sort of a graphical application. So we have a GUI application that's our software under test. And so we have to ask ourselves what's the domain and range of a graphical user application? The domain--that is to say, the set of legal inputs-- is basically just going to be the set of all possible GUI actions, so mouse clicks, keyboard events, swipes, etc. And the range is going to be possible states of the GUI application.
So now it should go without saying that the size of the domain here-- that is to say, the size of the set of all possible GUI inputs-- is really pretty gigantic. There are a lot of possible combinations of inputs.
Q. Now I have a quiz question for you. The question is what constitutes a set of good test inputs for a GUI application?
a). is just use the application normally.
b). is let a 4-year-old use the application for a while.
c). is use some sort of a tool to inject a stream of fake GUI events-- that is to say, a stream of automatically generated keyboard events, mouse clicks, and similar.
d). is reproduce GUI inputs--that is to say, mouse clicks and such-- that crashed our previous version of our software. Pick all of the answers that apply.
S. The answer is, of course, all of the above. All of these would be good ways to test a GUI application because a GUI interface, similar to an operating system interface, usually represents a trust boundary. And what I mean here is just the fact that it's usually the case that the GUI can't rely on a user to always present valid inputs. Therefore, a GUI application needs to be ready for basically anything. And it turns out that when my children were younger, one of them was really good at crashing Windows machines. You would just set him down at a Windows machine, and he could make it unusable within a couple of minutes of just pounding on the keyboard.
Let's talk about a final implication of the interaction between testing domains and trust relationships between different entities. Let's say we have some sort of an interface, and on both sides of it we have people we trust. Me, and some of my teammates.
And so the question we have to ask is can I trust my teammates, and can my teammates trust me to always generate inputs when using the various APIs that remain within the domain of those inputs? And of course the answer is generally no. In fact, I probably can't even trust myself to always generate inputs that are within the domain of acceptable inputs for APIs. What this brings us to is the idea of defensive coding-- that is to say, error checking--for its own sake to detect internal inconsistencies, and this is something that we'll get to a little bit later during this lesson.
So overall, testing software by calling into the APIs that it provides is fairly straightforward. We just make calls into the API and look at the results. But something inconvenient about real software is that software doesn't just provide APIs, it also uses them. What I mean here is that the software under test is going to be issuing calls into libraries and getting return values into the operating system and into virtual machines such as the Python runtime. So let's take, for example, just for the next couple of minutes the idea that the software under test is something like a web browser.
One thing we can do is test the web browser using the APIs that it provides-- that is to say, using its graphical user interface-- and not worry about testing it with respect to the APIs that it uses. And so what kind of APIs is the web browser using? For one thing, it's talking to the network, it's talking to the hard disk through the operating system, it's talking to all sorts of lower level interfaces. And sometimes those APIs don't act as we would expect. Just as a simple example, let's take the case where our web browser is storing cookies onto the hard drive of the local computer. Most of the time during testing, we expect the storage and retrieval of cookies to operate perfectly normally. But what happens if, for example, the hard disk is full when the web browser tries to store a cookie? Does the web browser crash? Does it mangle its internal state in some fashion and become impossible to use? Or does it gracefully stop storing cookies for that session and, for example, wait until there's more disk space free before it starts to store cookies again?
Of course we'd like our web browser to do the right thing, whatever it is, but on the other hand, we need to actually test this. If we just hope that the software does the right thing, then one of the golden rules of testing is we shouldn't ever just hope that it does something; we need to actually check this.
So the problem is that we have this fairly awkward problem where we don't actually control how the operating system responds to calls that we make. And what I mean by that is the awkward thing here is that we don't actually have direct control over how the operating system responds to calls that we make. So we can't easily just make storage of a cookie file fail. Rather, we would have to do something like create a full disk partition, arrange for our web browser to store cookies there, and then see how it responds. In this particular case, creating a full disk partition is awkward, but we could do it, but there are plenty of cases where lower level software has failure modes that we really can't easily simulate, regardless of how hard we work.
So what we're going to do now is go back to our friend the UNIX read system call. Let's take another quick look at the UNIX read system call.
READ(2) Linux Programmer's Manual READ(2) NAME read - read from a file descriptor SYNOPSIS #include <unistd.h> ssize_t read(int fd, void *buf, size_t count); DESCRIPTION read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf. If count is zero, read() returns zero and has no other results. If count is greater than SSIZE_MAX, the result is unspecified. RETURN VALUE On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. In this case it is left unspecified whether the file position (if any) changes.
And so this is our UNIX process's read from files, and so of course real UNIX programs are issuing calls to read constantly, maybe hundreds of times per second. And so earlier, we were concerned with the domain of the read system call-- that is to say, the set of possible valid arguments to the read system call
ssize_t read(int fd, void *buf, size_t count);
-- and now we're concerned with the range because now we're not testing the UNIX operating system anymore; we're testing a program that runs on top of the UNIX operating system. And so it's the response of the operating system back to the process that concerns us here. We can see here that read returns the number of bytes read from a file,
On success, the number of bytes read is returned
but there's an interesting fact here that read is allowed to read less bytes than you actually asked for. So it's going to return some number between 0 and count, but we don't know what number it's going to return. Another thing that read can do is just fail outright-- that is to say, it can return -1 to the application
On error, -1 is returned, and errno is set appropriately
-- but it turns out that there are a whole lot of different reasons for that kind of a failure. We can see here that there are at least 9 different error conditions that read can return. We have EAGAIN, EWOULDBLOCK, EBADF, EFAULT, etc. And the point is for the application that's calling read, the operating system can return any of these values, and these values aren't all semantically equivalent. The application might have to do different things depending on which of these values it gets. And the point is it might be very hard as people testing the web browser to actually make the operating system read call return all of those different values. And until we've tested it with all of those different values, we're left with software whose behavior we probably don't understand, and, therefore, it's software that hasn't been tested very well.
So I'd like to tell you that there's some really great solution to this problem, but there really isn't. And the fact is that a lot of real programs that run on operating systems aren't prepared for some of these odder, stranger responses that the operating systems can deliver, and, consequently, when those things happen, the software malfunctions.
So what we have is a fairly difficult testing problem. And in practice, there are a couple of different ways to deal with it. The first way doesn't actually have anything to do with testing, but it's such good advice that I can't resist giving it. And the advice is that you should always try to use low level programming languages or interfaces that are predictable and that return friendly error codes. What I mean is basically given a choice between using the UNIX system call interface which has all of these unfriendly features that we were just talking about and using the Python libraries, you'd almost always choose the Python libraries because they have far friendlier behavior and they have far fewer obscure error codes.
Of course we don't always have the option of doing this. We probably can't implement a full size web browser in Python and expect it to run really quickly and be fully featured, so we're forced to use these bad style APIs sometimes. From a testing point of view, we can often use the technique called fault injection to deal with these kind of problems.
Let's assume for the moment that we're using the Python library to create a file. We're going to be issuing a call like:
/tmp/foo is a path to the file we're trying to open-- and
'w' specifies that we're opening that file with write permissions. So now if our Python application issues this call, we might have a fairly hard time testing the case where the open call fails because on most machines there's going to be a directory called /tmp, and so this call might almost always succeed. So what we can do instead is call a different function,
my_open here in this case, which has an identical interface to the open call. And in fact, not only is the interface identical, but its implementation is also almost identical. So most of the time what my_open is going to do is simply call open. So what we have is a stub function that's almost functionally equivalent to open, but the key difference is we wrote the code for my_open, and what we can do with this is conditionally inside the my_open code we can sometimes cause the open system call to fail. And again, this is called fault injection, and it's a fairly common testing technique.
In practice, you have to be pretty careful with fault injection. One thing that can happen is if you make my_open fail too often, like, for example, it fails 50% of the time, then a program that's using it probably will never get off the ground. A large sized program that opens a lot of files will almost never succeed in opening all of them, and so we're not actually going to get to executing the interesting part of the file. So one thing we might do in my_open is have it initially succeed maybe for the first 100 calls, and after that it might do something like fail 1% of the time.
That's just an example. In practice, we would have to experiment with what kind of failure rates are good at testing the software that's actually under test.
So let's just take a quick quiz over this material.
Q. Faults injected into some software under test should be
a). all possible faults
c). faults that we want our code to be robust with respect to.
S. The answer is C. Generally, we don't need to inject all possible faults but rather just the ones that we expect might happen in practice and that we would like our system to be robust with respect to.
To recap a little bit, we have our software under test, and it provides some APIs. Each API's collection of functions and most of the work that we have during testing is going to be calling these functions with various values and looking at the results. Additionally, the software under test is going to use APIs provided by, for example, the operating system or the programming language runtime, and a separate kind of testing is looking at how the software under test responds to return codes and similar given to it by the APIs that it uses.
That would be great if both of these kinds of testing represented a complete picture of everything that we need to test, but in fact, that's not the case, and there are some added complications that we're going to talk about now.
The issue is that the overt explicit interfaces that we see here don't represent all the possible inputs and outputs that we might care about. For example, on the input side it's completely possible that our software under test cares about the time at which inputs arrive. So it might be the case that our software responds differently if 2 inputs arrive very close together than it does if 2 inputs arrive separated by a large amount of time. And if this seems silly, just consider, for example, the software that processes mouse clicks. Two clicks very close together represent a double click, and that's interpreted very differently from 2 mouse clicks that occur farther apart in time which count as 2 single mouse clicks.
Another example is something like a web browser where if the data corresponding to a web page is returned in a short window of time, this data will get rendered as a web page. But if the data that comes from the network is scattered across too much time, this is going to result in some sort of a time-out-- that is to say, the software under test, which is our web browser, will render some sort of an error page instead of actually rendering the data if it comes too slowly.
Both of these examples that we just looked at are fairly easy to test because in each case, we have this sort of binary distinction between, in one case, the data arriving quickly-- that is to say, a double click or a complete web page arriving before the time-out-- and in the other case, we have data arriving too slowly-- that is to say, 2 single clicks or a web page that takes so long to arrive that we time out. In other cases, the timing-dependent input can make our lives significantly more complicated.
So now we're going to look at kind of an extreme example of timing-dependent input being difficult to deal with. In the 1980s, there was a machine called a Therac-25. And what the Therac-25 was was a radiation therapy machine, it was used to deliver a highly concentrated beam of radiation to a part of a human body where that beam could be used to destroy cancerous tissue without harming tissue that's nearby.
As you can see, this is not, obviously, an inherently safe technology. It's going to depend on skilled operators and also highly safe software in order to safely dose patients at the cancer site without harming the patients.
So what happened with the Therac-25 was a tragic series of mistakes where 6 people were subjected to massive radiation overdoses and several of these people died. If you actually take a look at the Investigation of the Therac-25 Accidents, you'll find that it's really quite terrifying reading. It's really a very scary series of accidents.
The Therac-25 had a number of serious issues with its software, and we're just going to talk about 1 of them here.
The Therac-25 was a largely software-controlled device, and it had, at the time, a fairly sophisticated controller. It turned out that the people developing the software put a number of bugs into it. The particular bug that I'm talking about here was a software bug called a race condition. And what a race condition is, is a scenario where different threads of execution fail to be properly synchronized, with the result being that the software containing the race conditions can actually make mistakes. This particular race condition in the Therac-25 software involved the keyboard input to the radiation therapy machine, which is what the person operating the machine used to tell the machine how to treat the patient. And what happened was if the operator of the machine typed slowly, the bug was very unlikely to be triggered. And of course while the machine was being tested, the people testing the machine weren't very good at using it. They hadn't used it a lot, and so they didn't type very fast. But unfortunately, as operators in hospitals became more and more familiar with this machine, as they treated hundreds and hundreds of patients, what happened was these people got very good at operating the machine, they typed faster and faster, and eventually they started triggering this bug. And the effect of this bug, unfortunately, was to deliver massive radiation overdoses to patients. And, as I said, this led to several fatalities.
And so the kind of quandary that this scenario raises for us as software testers is do we have to care about the time at which inputs arrive at our software under test, or can we not worry about that? And so obviously, for the Therac-25 and obviously, also for something like a Linux kernel, the time at which inputs arrive is relevant. On the other hand, unless we've been extremely sloppy, the square root function that we've been talking about won't care about the time at which its inputs arrive.
Q. Now we're going to take a really quick quiz, and the question is would you consider it likely that the timing of inputs is important for testing:
a). software under test that interacts directly with hardware devices.
b). software under test that interfaces with other machines on the internet.
c). software under test that is multi-threaded.
d). software under test that prints time values into a log file.
Pick all of the answers that you think are good ones.
S. I think to some extent the answers here are debatable, but the best answer that I would give to this question is definitely a software under test that interacts directly with hardware devices would consider the timing of inputs important. Also definitely software that interfaces with other machines on a network might consider the timing of inputs important. Software under test that's multi-threaded won't necessarily depend on the time at which inputs are delivered, but it certainly might, and finally, I would not consider software under test that prints time values into a log file to necessarily depend on the timing that its inputs are delivered. It's this kind of software--of which there are many examples-- a software that does care about time values because it's printing them but probably its logic doesn't depend on these time values because they're just being output to a device, and they're not being depended on in any meaningful way. Again, your mileage may vary on this, but my opinion is that a, b, and c are the best answers.
What you'll notice here is that I gave you a quiz before telling you how to actually determine whether the timing of inputs is important for a particular software under test, and that was deliberate because I wanted to get you to think about the issues, but also because there aren't really any great hard and fast answers here, so let's talk about this a little bit more.
Basically what you want to do to figure out if timing matters for the inputs for the software under test is think about its specification. Think about what it's trying to do. Think about its requirements, and also look at the source code for the software, and you want to look for things like timeouts, things like timed sleeps, and basically values or computations that depend on the time at which things happen. If you see any of that, then it's strongly possible that you're going to need to take time into account while testing the software, and that means, for example, delivering some inputs to the software under test very quickly, delivering other inputs rapidly, and seeing how the software responds.
Okay, so what we're doing here is continuing to dig into some of the nuances of testing software, and we're going to keep looking at things that can make testing hard. The final issue that I want to talk about in this general vein of things that can complicate testing is what I like to call non-functional inputs, and these are inputs that affect the operation of a software under test that have nothing to do with the APIs provided by the software that we're testing and nothing to do with the APIs that are used by the software that we're testing.
So what are these non-functional inputs? Well, they include things like context switches. What context switches are are switches between different threads of execution in a multi-threaded software under test.
Now, of course, you shouldn't have to worry about context switches at all if your software under test only has a single thread, and the assignments we're going to be working on in this course are single threaded, but in general it's very common for complex software systems to convey multiple threads of execution.
The issue is that these multiple threads of execution are scheduled along different processors on the physical machine that we're running on at different times, and it's the operating system that makes the decisions about what thread goes on what processor at what time, and depending on the scheduling of these threads bugs in the software under test can either be concealed or revealed, and the problem is that the timing of these context switches is completely not under the control of our application. It's under the control of the operating system, which provides these non-functional inputs, and this makes testing multi-threaded software actually really, really quite difficult.
Let me give you another example of non-functional inputs. Some years ago, in the late 1990s, I spent a summer working at a company that made very, very fast networking hardware, and this hardware, for example, would let 2 PCs talk at multi-gigabit speeds using a switch. Sort of the interesting thing about the software that we we're developing that ran not only on the PC, but also on a network card, was that it was completely independent of the TCP/IP stack that normally provides reliable delivery for machines connected by networks and also it was supposed to provide reliable delivery of data even when we had errors in the network.
So a problem that we faced was this network in fact was extremely reliable. It would introduce bit errors into the data being transmitted maybe only once every several hours or maybe even only every several days, and so what we faced was a real difficulty in testing the end-to-end reliability software running on the PCs because the network was so reliable.
So what we often ended up doing was we would open up a switch, exposing all of the electrical contacts inside, and then we would take a key, a metal key, and run it across the contacts that were exposed from some of the chips on the inside of the switch. What this would do is introduce a massive number of very short-lived short circuits inside the switch, causing a huge number of bit errors causing the software running on the PCs to have to cope with all of these bit errors. Of course, either the network would glitch for a moment and then when it started up the data transfer would resume, or else it would fail to resume properly, in which case possibly the software crashed or possibly it delivered erroneous data, in which case we had some debugging to do.
And so what I like about this example is we're able to access using this very crude mechanism of running a key across contacts sort of a deep level of the system and introduce errors that we weren't able to introduce at any other--at least conveniently-- level of the software stack, and so this, again, was another kind of non-functional input to the system under test that by getting control over we were able to perform better testing of the software.
What I want to do now is look at some of the different kinds of testing that exist, and so what this really is is a very, very brief survey. It's not intended to be exhaustive, and it's not intended to give you an orthogonal set of kinds of testing because there's going to be a bunch of overlap between these different categories.
One kind of testing you often hear about is white box testing, and what white box testing refers to is the fact that the tester is using detailed knowledge about the internals of the system in order to construct better test cases. In the next lesson we're going to really look at white box testing in quite a bit of detail.
The complement to white box testing is black box testing, and black box testing refers to the fact that we're not using detailed knowledge of the system's internals. We're rather testing the system based on what we know about how it's supposed to respond to our test cases.
Unit testing is something we're going to spend a lot of time in this course looking at, and unit testing means looking at some small software module at a time and testing it in an isolated fashion, this is what we were doing with the bounded buffer example from the last unit and with the square root function that we were talking about earlier in this unit. The main thing that distinguishes unit testing from other kinds of testing is that we're testing a smaller amount of software. Often the person performing the unit testing is the same person who implemented the module, and in that case we may well be doing white box testing. But unit testing can also be black box testing.
The goal of unit testing is to find defects in the internal logic of the software under test as early as possible in order to create more robust software modules that we can compose later with other modules and end up with a system that actually works.
Besides the size of the software under test the other thing that distinguishes unit testing from other kinds of testing is that generally at this level we have no hypothesis about the patterns of usage of the software under test. In other words, we're going to try to test the unit with inputs from all different parts of its domain. Remember, the domain is the set of possible inputs for this piece of software.
Unit testing is also a kind of testing that enjoys a great deal of good tool support, and so Python, in fact, has a number of frameworks for unit testing. It also has a number of frameworks for what are called mock objects, and what mock objects are are fake objects that we can bolt on to the software under test that mock up the behavior of the larger software system in which this unit lives.
The next kind of software testing that I want to talk about is integration testing. Integration testing refers to taking multiple software modules that have already been unit tested and testing them in combination with each other.
At this point often what we're really testing are the interfaces between modules, and the question is did we define them tightly enough and specify them tightly enough that the different people or the different groups of people implementing the different modules were able to make compatible assumptions, which are necessary for the software modules to all actually work together. And it often turns out that during integration testing we find substantial differences of assumption that have to be resolved before we can create a reliable integration of the individual parts.
And so as a computer science professor, what I find is that groups of students are almost always successful at creating well-tested, reliable software modules, but that separate groups of students who individually create software modules, even when implementing to a fairly tight specification, often create software modules that fail under integration testing. It's really something that's quite a lot harder to do. Coming up with a software module that survives integration testing is really quite a lot harder often than creating reliable small units.
The next kind of testing is system testing, and here we're asking a different question than we were asking with integration testing and unit testing. Here we're asking the question does the system as a whole meet its goals? Often at this point we're doing black box testing, and that's for a couple of reasons.
And so another thing that differentiates system testing from unit testing is the fact that at this level we are often concerned with how the system will be used. And the fact is, at this level we may not care about making the system work for all possible use cases. Rather, we would simply like to make sure that it performs acceptably for the important use cases.
Somewhat orthogonal to unit testing, integration testing, and system testing is a testing technique called differential testing.
In differential testing what we're doing is taking the same test input, delivering it to 2 different implementations of the software under test, and comparing them for equality. Differential testing is extremely powerful in the case where we have multiple versions.
Stress testing is a kind of testing where a system is tested at or beyond its normal usage limits, and it's probably best described through a couple of examples.
Stress testing is typically done to assess the robustness and reliability of the software under test.
And finally, we have what might be my favorite kind of testing, random testing. In random testing we use the results of a pseudo-random number generator to randomly create test inputs, and we deliver those to the system under test. And sometimes we're just feeding raw, random bits to the software under test, but very often this can be much more subtle and more sophisticated than just throwing random bits at the software. Random testing is very often useful for finding corner cases in software systems, the crashme program for Linux and other Unix kernels that we talked about earlier is a great example of a random tester, and we'll talk about more examples later on in this course.
Now I'd like you to take a short quiz. I'm going to describe a kind of testing that somebody is doing, and you're going to tell me which of the kinds of testing that we've discussed that it's most like.
Q. We have a piece of software under test, and this is some sort of a really critical embedded system, maybe, for example, it's controlling airbags on a vehicle. And this vehicle is going to be out in the field for many years. Cars get subjected to fairly extreme conditions. They have a lot of heating and cooling cycles. They get exposed to salt and mud and water, and so the cars are in a fairly harsh environment. One of the things we're going to want to know is how the software responds to defects that get introduced into the automobile's wiring as this vehicle ages in the field, that is to say as people are driving it around the salty roads in Michigan or whatever.
What we're going to do is using maybe a simulator or something we're going to make it look like a wire that's used for a processor, for software running on a processor to talk with other modules, (so cars are a really big distributed system). The software under test here is just one of many pieces of software running on the car.
We're going to use some sort of a simulation module to simulate a noisy wire, and that basically means, for example, a wire whose insulation has rubbed off and that is not perfectly conducting signals anymore. It's getting junk injected into it because it's banging against the car's frame or something. The question is what kind of testing are we doing? And your choices are:
a). regression testing,
b). unit testing,
c). stress testing,
d). white box testing.
Choose whichever you think best describes what we're doing.
S. As far as I'm concerned, the best answer here is stress testing(c).
We're not doing regression testing because regression testing always involves taking inputs that previously made the system fail and replaying them against the system, and the way I described it that wasn't the case here. We're probably not doing unit testing because the software under test running on this processor, even if it's just one part of all of the software running on a car, still has many modules. It has an operating system. It has some communication software. It has the actual application functionality, whatever it's doing, controlling airbags, and so we're most likely not doing unit testing, although if you answered unit testing that's not a bad answer. But my personal judgement is that it's not the best one, and we're certainly not doing white box testing because we haven't considered at all the structure of the software under test here. This is really a kind of black box testing.
Q. All right, in this next example software under test is some sort of a web service, and what I'm going to do is I'm just going to go ahead and roll this out onto one of the machines in our cluster, but I'm not going to expose this to most of our users. What I'm going to do is expose this machine to some small fraction of our users who have been selected, for example, based on their willingness and desire to use new features.
The question is what kind of testing is this most like? Is this:
a). Integration testing?
b). Differential testing?
c). Random testing?
d). System/validation testing?
Okay, so please take a second to go ahead and mark what you believe is the kind of testing that's the best match for what it is that we're doing here.
S. What I think the best answer is, is that we're doing system or validation testing(d).
That is to say we're testing the whole system as a unit, and we're trying it on some small subset of our users to try to validate the fact that the changes we've made are good, that they work and they're not going to cause any sort of major problems when we roll it out for the entire user base.
We're not doing random testing because there's no random element here. We're not doing differential testing because we don't have multiple implementations of anything. We have a single system that we're looking at. We're not doing integration testing because the character of the testing that we're doing here does not have the flavor of putting multiple modules together that haven't been tested together previously. Rather, we're rolling something out and seeing what happens.
Q. In this scenario we have some large software under test, and part of it is a library that's been flaky and has been giving us problems. Let's say this library is implementing numerical functions, and these numerical functions have been sometimes throwing floating point exceptions and crashing our system.
What happens is the vendor has given us a new version, but what we don't want to do is just roll that directly into our production system because in the past we've gotten burned by that sort of thing where we have deployed a new version and it contains flaws that were worse than the flaws in the previous version. What we're going to do is we're going to spend some time checking this thing out and trying to make sure that even if it's not that great it's at least not worse than the previous version.
The question I have for you is what kind of testing is this most like?
a). Is it unit testing?
b). Is it white box testing?
c). Is it black box testing?
Or, on the other hand,
d). Is it stress testing?
Okay, please take a minute to mark what you think is the best answer.
S. The answer is it could have been any of these, but I think the best answer is unit testing because what we're doing is we're taking a software module, testing it in isolation to see if it's up to our standards before we integrate it with the rest of our software. Now, the type of testing that we're actually doing could have been white box testing, could have been black box testing, and could have been stress testing. But I didn't give you enough details about what we were doing to allow you, I believe, to conclude that any of those was definitely what we were doing, so what I think is the best answer here is unit testing.
Now I'd like to talk about being great at testing software, and this involves a number of different elements. First of all, we need to recognize that testing and development are fundamentally different roles, even if they're often played by the same exact people. A developer's primary attitude is:
"I want this code to succeed."
A tester's main attitude is:
"I want this code to fail."
Of course, the reason the tester wants the code to fail is that the tester's end goal is creating stronger code which later on doesn't fail.
If we look at these requirements a little bit we can see that for the same person to be a great tester and a great developer there might be a little bit of doublethink involved, those of you who read Orwell will know that the doublethink is the ability to hold 2 contradictory beliefs in one's mind simultaneously, and there is a bit of that required in order to be a great tester and a great developer. The contradictory nature of these 2 beliefs is only apparent because, of course, the developers and the testers in the end want the same thing, which is to create a great software.
The second element of being great at testing is to learn to test creatively, and I showed the example earlier of testing an ARM assembler and disassembler by exploiting the fact that they were inverses of each other and also exploiting the fact that the ARM instruction space could be fully enumerated, and that's something that I consider to be a great example of creative testing. If we can think of creative ways to test we often do a much better job than rote testing, and furthermore, do a much better job than just rote testing with the most obvious inputs.
The third thing is that great testers never ignore weird things that they notice. It turns out that at least in my experience, it's very often the case that we get little hints of things wrong with the software that lead to threads that if tugged on would have led us to discover problems that were really quite serious. On the other hand, if we see these little things wrong and we paper over them, we ignore them, we end up not finding those problems until later.
I'm going to end up today with the claim that great testing is a lot of fun because it's fun to break software, and it's very satisfying to produce software that's really great because it's been well tested, and great testing can also be profitable. And what I mean here is that testing is a separate career at many companies like Microsoft, and that companies like Google and Mozilla offer bug bounties. What this means is if you find a security critical bug in, for example, Google Chrome, they'll pay you up to $20,000, and this isn't at all theoretical. I was recently talking to a software testing researcher who wrote an automatic tester and applied it to Chrome and to Mozilla, and their research group, over the course of a few months, made about $50,000.