cs258 »

back

CS258 Lesson 2

Contents

1. Introduction

You can usually think of problems in released software as coming from things that people forgot to test, what that means is testing was inadequate and the developers were unaware of that fact.

What we're going to look at this week is a collection of techniques called code coverage where automated tools can tell us places where our testing strategy is not doing a good job and that's a very powerful and useful.

2. How much testing is enough?

One of the trickiest things about testing software is that it's hard to know when you've done enough testing, and the fact is, it is really easy to spend a lot of time testing and to start to believe that you did a good job and then to have some really nasty bugs show up that are triggered by parts of the input space that we just didn't think to test. 

So let's look what's really going on,

  • so we're going to start with the domain range diagrams that we known of.
  • We're going to have some test cases into domain and after a while
  • we're going to start feeling pretty good about what we've been doing.

258-2-2a.png

What's going to turn out is, without really knowing it, our testing is being compliant to some small part of the input domain. The problem, is that even that small part of the domain may contain an infinite number of test cases or else a finite number of test cases that is large enough that, for practical purposes, it's not distinguishable from infinity. Of course, what's going on is we're going to do test cases in other parts of the domain we didn't think to test, putting the results and outputs that are not okay.

258-2-2b.png

It really depends on how we've broken up the input domain.

For example, let's think about the case that we're testing the Therac-25 radiation therapy machine that I used as an example in the last lecture.

It might not be the case that all of the inputs that we are going to test, the ones in this region are the ones where we happened to type the input to the machines slowly. We simply didn't realize that there's a whole other part of the input space that has to be treated differently that happens when we type the input fast. And of course it's in that region where we triggered the race conditions that were leading to massive overdoses.

Similarly, if you remember from lecture one, we have a small Python program that caused the Python runtime to crash. The distinguishing feature of it seem to be a large number of cascaded if statements.

It's pretty easy if we're testing Python programs to remain in the part of this space where for example we have less than 5 nested if statements.

Over here is another region containing programs with 5 or more if statement nested and these are ones that cause for example, the Python virtual machine to crash. 

To take an even more extreme example, let's say that we're testing some software that somebody has inserted in a back door. Well in that case, there's going to be an absolutely infinitesimal part of the input domain, maybe way over here that triggers the back door, because remember if you're putting a backdoor in code, you don't want to trigger it accidentally that's going to lead to something extremely bad happening over here.  We didn't test inputs triggering the back door because we just didn't know it was there. 

So what we'd really like is some sort of a tool or some sort of a methodology, that if we are in fact testing only a small part of the input domain for a system, what we would really like is some sort of an automated scoring system that looks at our testing effort and says to us something like, your score is 14 out of 100. You're not doing a good job testing the system. Keep trying.

And that's what today's lecture is going to be about.

It turns out there's a lot of reasons we'd want to know the assigned score to our testing effort. 

  • Probably the main one is that this helps us find part of the input domain that needs more testing.

So for example, if we can increase our score by testing this part of the domain, we're naturally going to be led to do that and our testing efforts will improve.

  • Other reasons to assign a score to a testing effort are that we might be able to argue to a boss, which is some sort of a certifying authority, that we've done enough testing.
  • Similarly, we might want to argue that the system we're developing has not been tested well enough but it's not yet safe to deploy and that it needs more testing and a testing score can provide a numerical justification of that sort of way.

Probably, it would be nicer if we could take a large test suite, one maybe that takes several days to run and identify parts of the test suite that are completely redundant. That is to say parts of the test suite that tests exactly the same parts of the input domain. Even though they occupy parts of the input domain that have roughly the same testing effect on the system to assigning a score to our testing efforts and let us do that as well.

3. Partitioning the input domain

This time, let's talk about some theory called partitioning the input domain.

So, we are going to start with some software under test and it's going to have a set of possible inputs, an input domain, and of course, this input domain usually consists of so many possible test cases, but there’s no way we can possibly test them all.

Speaking historically, what people would've often been interested in is ways to partition the input domain for a piece of software under test into a number of different classes so that all of the points within each class are treated the same by the system under test. And while constructing these classes, we are allowed to look at the implementation of the software, we are allowed to look at the specification, we are even allowed to use our vague suspicions that we have. We can use anything we want in order to create these partitions.

So, for example, we will have some subset of the input domain. For purposes of finding defects in the system under test, any point within that subdomain is as good as any other point within that subdomain. So, basically, when testing the system, we pick an arbitary point, execute the system on it, look at the output, and if it is acceptable, then we’re done testing that class of inputs.

So, obviously, in practice, sometimes this partition is going to fail, and by fail, I mean that the thing that we thought was a class of inputs that are all equivalent with respect to the system under test, isn’t really, and in fact, there is a different class hiding within this class which triggers a bug even though the original test case didn’t. And so, when that happens, we can sort of blame the partitioning scheme. We can say that we improperly partition the input.

The problem with this sort of scheme is that we can always blame the partitioning, and the unfortunate fact is the original definition of this partitioning scheme didn’t really give us extremely good guidance in how to actually do the partitioning. All that said was show to do it.

In fact, this sort of scheme hasn’t worked out for large systems under test. We’re talking complex software like real time embedded systems, operating systems or other things.

4. Coverage

So in practice what we ended up with is not this idea of coming up with a good partitioning for the input domain rather than a notion of test coverage. What the test coverage is doing is to try to accomplish exact the same thing that partitioning was accomplishing, but it goes about in a different way.

Test coverage is an automatic way of partitioning the input domain with some observed features of the source code so let me say what I mean. So one particular kind of test coverage that we might aim for that is sort of an easy kind of test coverage is called function coverage,

Function coverage

Function coverage is achieved if we managed to test our system in such a way that every function in our source code is executed during testing.

We will be dividing our input domain in chunks where any test case in this part of the input space is going to result, for example, in a call to foo. So now, there's going to be some different subset of our input domain and any point in this subset of the input domain when used as a test input is going to result in a different function in the system under test--let's say bar being called.

We can keep subdividing the input domain for the software under test until we have split it into parts that results in every function being called.

So now, of course, doing this in theory is easy. In practice, we start with a set of test cases, and we run them all through the software under test. We see which functions are called, and then we're going to end up with some sort of a score.

So, for example, some sort of a tool that is watching our software execute can say --we called 181/250 functions. And so what this kind of score is called a test coverage metric. It means that our test cases so far have covered 181 of the functions out of the 250 that we implemented. And so now that we have achieved this goal that we had, which is assigning a score to a collection of test cases. the next thing we have to ask is this score any good? Is that a good test coverage to have executed 181/250 functions?

With this example, it's probably not.

So, what we can do is for each of the functions that wasn't covered, we can go and look at it. And we can try to come up with the test input that causes that function to execute.

There is some function baz here, and we can't seem to devise an input that causes it to execute, then there are a couple of possibilities.

  • One possibility is that it can't be called at all. It's dead code.
  • Another possibility is that we simply don't understand our system well enough to be able to trigger it.

Either way, there is something possibly suspicious or wrong.

5. Test coverage

Test coverage let’s us assign a score to a collection of test cases, and so let's be a little bit more rigorous about it.

So test coverage is really is a measure of the proportion of a program exercised during testing. So, for example, we’ve just talked about measuring a number of functions out of the total number of functions or exercise by some test that we had. What’s good about test coverage is it gives us a score, gives us something objective that we can use to try figure out how well we’re doing. Additionally, when coverage is less than 100%, that is to say, as in our example, where we had failed to execute all of the functions in the software under test, we know what we need to do to get full coverage. We know what the functions are that we need to execute. Now, we simply need to construct test cases and execute those functions.

So, these are the good things about test coverage.

On the other hand, there are also some disadvantages.

  • First of all, test coverage, because it is a white box metric that is derived from the source code for our system is not good at helping us find bugs of omission, that is to say bugs were simply left out of something that we should have implemented.
  • The second drawback is it could be really hard to know what a test coverage score less 100% means and in safety critical software development what's sometimes done is requiring 100% test coverage of a certain coverage metric and that sort of removes this problem, it means that we don’t have to interpret scores less than 100% because we’re not allowed to ship our product until we get 100% test coverage.

For a larger, more complex software systems where the standards are correct and not as high as they are for safety critical systems, that’s often the case, but it’s difficult or impossible to achieve 100% test coverage. Leaving us with this problem we’re trying to figure out what that actually means about the software. - The third disadvantage is even 100% coverage doesn’t mean that all bugs are found and you could see that sort of easily by thinking about the example way we’re measuring our coverage by looking at the number of functions we executed.

We Just executed some function, of course, it doesn’t mean that we found all the bugs in that function. We may not have executed very much of it or may not have somehow found very many of the interesting behaviors inside that function.

6. Splay tree

We'll that's enough abstract ideas for now without coverage. Let's take a concrete look of what a coverage can do for us in practice. How it can help us do better testing.

What are we going to do here is look at some random open source Python code that implements a splay tree. Before we go into the details let me briefly explain what a splay tree is.

Broadly speaking a splay tree is just a kind of binary search tree. A binary search tree is a tree where every node has two leaves and it supports operations such as insert, delete, and lookup. The main important thing about a binary search tree, is the keys have to support an order in relation. An order in relation is anything that assigns a total ordering to all the keys in the space. Just as a simple example if we're using integers for keys then we can use less than for order in relation. Or for example, if we're using words as our keys, then we can use dictionary order.

And so again, it doesn't matter what kind of a data type the key is in a binary search tree. All that really matters is the keys we're going to use to look up elements in the tree have this order and relation.

The way the binary search tree is going to work is, we're going to build up a tree under the invariant that the left child of any node always has a key that's ordered before the key of the parent node and the right child is always ordered after the parent node using the ordering. And so, hopefully what we can see now, is that if we build up some sort of a large tree with this kind of shape, we have a procedure for fast lookup.

258-2-6a.png

The way that lookup works is, when we're searching for a particular key in the binary search tree, we only have to walk one path from the root down to the leaves. We always know which subtree might contain the key that we're looking for, and of course, we have to actually go down into that subtree to see if it's there, but the point is we only have to walk one path to the tree.

The upside is that generally, operations on this kind of a search tree, we're going to require a number of tree operations logarithmic in the number of tree nodes.

So for example, we if we have a tree with a million nodes and if that tree that we build ends up being relatively balanced, then usually, we're going to expect to do about \log_2\space1,000,000 tree operations as part of a lookup, insert, or delete and so we're going to end up doing roughly 20 operations.

So, you can see some speed number of operations is far lower than a number of nodes in a tree that this is generally considered to be an efficient kind of a data structure.

7. Splay tree issues

Looking at that, there's a few extra things you need to know about splay trees, - first is that it's a self-balancing tree which means that if you insert nodes in sort of order, remember we have a binary search tree but we call this stability tree that looks something like this.

258-2-7a.png

And so as you can see, this is a very bad kind of a binary search tree because here to do a lookup, we're going to have to walk all of the nodes in a tree and we've lost this nice logarithmic property that made lookups, searches and deletes extremely fast.

The way a self-balancing binary search tree works is, as you add elements to the tree, it has some sort of a balancing procedure that keeps the thing approximately balanced so that tree operations remain very fast.

If you look at the literature, it turns out there are tons and tons of different data structures that offer self-balancing binary search trees and the fast ones of these end up being somewhat complicated. The splay tree is really one of the simplest examples of a self-balancing binary search tree and the implementation that we're going to look at in a minute contains something like a 100 lines of code.

So the other thing you need to know about splay tree before you get into the code is that it has a really cool property that when we access the nodes, let's say we do a lookup of this node here which contains 7, what's going to happen is as a side-effect of the lookup that node is going to get migrated up to the root and then whatever was previously at the root is going to be pushed down and possibly some sort of a balancing operation is going to happen.

The point is, is that frequently accessed elements end up being pushed towards the root of a tree and therefore, future accesses to these elements become even faster. So this is sort of a really nifty feature of the splay tree.

So what we're going to do now is look at an open source Python splay tree, specifically a random-based structure that I found on the web and that happens to implement a splay tree--it comes with its own test suite and we're going to look at what kind of code coverage that this test suite gets on the splay tree.

An evaluation of self-adjusting binary search tree techniques

8. Splay tree example

So here is what the splay tree code in an editor called Emacs, and this is my personal editor of choice. It's doing syntax highlighting and indenting, very much like the Udacity online IDE would do for you. But here what I'm doing is running this on my local PC here.

# Licensed under the MIT license: 
# http://www.opensource.org/licenses/mit-license.php

class Node:
    def __init__(self, key):
        self.key = key
        self.left = self.right = None

    def equals(self, node):
        return self.key == node.key

class SplayTree:
    def __init__(self):
        self.root = None
        self.header = Node(None) #For splay()

    def insert(self, key):
        if (self.root == None):
        self.root = Node(key)
            return

        self.splay(key)
        if self.root.key == key:
            # If the key is already there in the tree, don't do anything.
            return

        n = Node(key)
        if key < self.root.key:
            n.left = self.root.left
            n.right = self.root
            self.root.left = None
        else:
            n.right = self.root.right
            n.left = self.root
            self.root.right = None
        self.root = n

    def remove(self, key):
        self.splay(key)
        if key != self.root.key:
            raise 'key not found in tree'

        # Now delete the root.
        if self.root.left== None:
            self.root = self.root.right
        else:
            x = self.root.right
            self.root = self.root.left
            self.splay(key)
            self.root.right = x

    def findMin(self):
        if self.root == None:
            return None
        x = self.root
        while x.left != None:
            x = x.left
        self.splay(x.key)
        return x.key

    def findMax(self):
        if self.root == None:
            return None
        x = self.root
        while (x.right != None):
            x = x.right
        self.splay(x.key)
        return x.key

    def find(self, key):
        if self.root == None:
            return None
        self.splay(key)
        if self.root.key != key:
            return None
        return self.root.key

    def isEmpty(self):
        return self.root == None

    def splay(self, key):
        l = r = self.header
        t = self.root
        self.header.left = self.header.right = None
        while True:
            if key < t.key:
                if t.left == None:
                    break
                if key < t.left.key:
                    y = t.left
                    t.left = y.right
                    y.right = t
                    t = y
                    if t.left == None:
                        break
                r.left = t
                r = t
                t = t.left
            elif key > t.key:
                if t.right == None:
                    break
                if key > t.right.key:
                    y = t.right
                    t.right = y.left
                    y.left = t
                    t = y
                    if t.right == None:
                        break
                l.right = t
                l = t
                t = t.right
            else:
                break
        l.right = t.left
        r.left = t.right
        t.left = self.header.right
        t.right = self.header.left
        self.root = t

And so we can see is, we have a splay tree class, it supports an insert method.

    def insert(self, key):
        if (self.root == None):
        self.root = Node(key)
            return

        self.splay(key)
        if self.root.key == key:
            # If the key is already there in the tree, don't do anything.
            return

        n = Node(key)
        if key < self.root.key:
            n.left = self.root.left
            n.right = self.root
            self.root.left = None
        else:
            n.right = self.root.right
            n.left = self.root
            self.root.right = None
        self.root = n

The remove method, this takes a key out of the tree:

    def remove(self, key):
        self.splay(key)
        if key != self.root.key:
            raise 'key not found in tree'

        # Now delete the root.
        if self.root.left== None:
            self.root = self.root.right
        else:
            x = self.root.right
            self.root = self.root.left
            self.splay(key)
            self.root.right = x

and a couple of other operations.

So insert, remove, and lookup are the basic operations supported by any binary search tree, but many implementations support additional operations. So find is the lookup operation for this tree.

    def find(self, key):
        if self.root == None:
            return None
        self.splay(key)
        if self.root.key != key:
            return None
        return self.root.key

And then there's the splay operation and what the splay does is, it moves a particular key up to the root of the binary search tree up to the root of the splay tree. This serves as both the balancing operation for the splay tree and also the look up. So, the splay routine involves a couple of screenfuls of codes but it's still fairly simple.

And now, let's look quickly at the test suite first.

import unittest
from splay import SplayTree

# Licensed under the MIT license: 
# http://www.opensource.org/licenses/mit-license.php

class TestCase(unittest.TestCase):
    def setUp(self):
        self.keys = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        self.t = SplayTree()
        for key in self.keys:
            self.t.insert(key)

    def testInsert(self):
        for key in self.keys:
            self.assertEquals(key, self.t.find(key))

    def testRemove(self):
        for key in self.keys:
            self.t.remove(key)
            self.assertEquals(self.t.find(key), None)

    def testLargeInserts(self):
        t = SplayTree()
        nums = 40000
        gap = 307
        i = gap
        while i != 0:
            t.insert(i)
            i = (i + gap) % nums

    def testIsEmpty(self):
        self.assertFalse(self.t.isEmpty())
        t = SplayTree()
        self.assertTrue(t.isEmpty())

    def testMinMax(self):
        self.assertEquals(self.t.findMin(), 0)
        self.assertEquals(self.t.findMax(), 9)

if __name__ == "__main__":
    unittest.main()

So the author of this open source splay tree for the test suite. In this test suite, you can see imports for the Python module unittest. And what unit test does is basically provides a little bit of infrastructure for running unit tests. So, what it's going to do is automatically call these different functions, which are defined for an object inherited from the case test framework

    def setUp(self):
        self.keys = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        self.t = SplayTree()
        for key in self.keys:
            self.t.insert(key)

And you can see here, that what we do is for example, we have a setup routine that initializes a list of keys. We make ourselves a new splay tree and insert those 10 elements into the tree. We have a test insert function which loops over all of the elements of the tree and asserts that they're found.

    def testInsert(self):
        for key in self.keys:
            self.assertEquals(key, self.t.find(key))

We have a remove test function which is going to be called last in our test suite.

    def testRemove(self):
        for key in self.keys:
            self.t.remove(key)
            self.assertEquals(self.t.find(key), None)

The functions beginning with a string 'test' are called in alphabetical order, and this happens to be the last one. This is going to remove all of the keys from the tree.

Then finally, we have a function which is going to insert a large number of elements in the tree.

    def testLargeInserts(self):
        t = SplayTree()
        nums = 40000
        gap = 307
        i = gap
        while i != 0:
            t.insert(i)
            i = (i + gap) % nums

It's going to insert 40,000 elements into the tree. It's going to make sure that they get staggered by a gap of 307. This is basically going to end up stress testing our splay tree a little bit.

Finally, we have a routine that tests whether the tree correctly knows that it's empty.

    def testIsEmpty(self):
        self.assertFalse(self.t.isEmpty())
        t = SplayTree()
        self.assertTrue(t.isEmpty())

Okay, so this is our sort of minimal unit test.

What we can do is, we can run this and this is going to take just a couple of seconds to run. Okay, so all of those tests took about 1.6 seconds to run. And what I'm going to do now is run the same test under a code coverage string.

coverage erase ; coverage run splay_test.py ; coverage html -i

So, I'm doing here is erasing code coverage date which has been previously stored in this directory. Running the splay test under the coverage framework which is basically going to run an instrument and version of the code and we'll talk about how this all works a little bit later and what it means, and then we're going to generate some HTML.

So, what you can see is it ran quite a bit slower this time because instrumenting the code to measure a coverage has a lot of instructions to the code and this takes extra time.

Okay, now we're going to look at the output. covressum.png

Here we are looking at the output of the code coverage tool, and what it's telling us is that when we run the splay tree on it's own unit test, out of the 98 statements in the file, 89 of them got run, 9 of them failed to run, and we'll talk about this excluded business later.

So, the language that didn't get run is marked in red.

And so this first one we're going to skip over for now and let's look at the second example.

covresfailinsert.png

So, the second line that didn't get run is in the splay trees insert function, but what we can see is that the test suite failed to test the case where we inserted an element into the tree and it was already there. And if you look at this, this looks fairly harmless, we're just erroring out but we're not going to worry about this one right now.

So, let's move on a little bit.

covresfailremove.png

So, here we got something a little bit more interesting. So here we're in the splay tree's removing function. We're going to remove an element from the tree. And so what you can see is that the first thing this function does is, splays the tree based on the key to be removed, and so this is intended to draw that key up to the root node of the tree. If the root node of the tree does not have the key that we're looking for, then we're going to raise an exeption saying that this key wasn't found.

But the thing that's noticed here is that this wasn't tested. If we look below here and go on to the body of the delete function, we see a pretty significant chunk of code that wasn't tested, probably want to go back and revisit this.

covresfaildeletebody.png

And so, let's take a step back and think about what coverage tool is telling us here.

It's basically showing us what we didn't think to test with the unit test that we wrote so far. So, what we're going to do is go ahead and fix the unit test a little bit in order to test this case.

In order to test the case of removing an element from the tree where it's not there, so let's go ahead and do that.

9. Improving coverage

We're going to the test suite, we're going to go to the test removed.

    def testRemove(self):
        for key in self.keys:
            self.t.remove(key)
            self.assertEquals(self.t.find(key), None)

After we've removed everything from a tree, we're going to remove an element that we know is not on the tree and of course after we removed everything from the tree, anything we choose should not be on the tree so -999 will work as well as any, so we're going to go ahead and save that and run the coverage tool again.

    def testRemove(self):
        for key in self.keys:
            self.t.remove(key)
            self.assertEquals(self.t.find(key), None)
        self.t.remove(-999)

So this time something interesting happens. What happened is the remove method for this splay tree on removal of -999, on the line that we just added causes an exception re-thrown in this splay function

Traceback (most recent call last):
  File "splay_test.py", line 19, in testRemove
    self.t.remove(-999)
  File "/Users/udacity/Public/cs258/files/splay.py", line 36, in remove
    self.splay(key)
  File "/Users/udacity/Public/cs258/files/splay.py", line 83, in splay
    if < t.key:
AtrributeError: 'Nonetype' object has no attribute 'key'

and so let's go back and look at those splay tree code.

So when we removed an element from the tree that wasn't there, it's suppose to raise the exception key not found in tree.

    def remove(self, key):
        self.splay(key)
        if key != self.root.key:
            raise 'key not found in tree'
    ...

On the other hand, what it's actually doing is failing quite a bit below here in the middle of the splay function when the code does a comparison against an element of type none

    def splay(self, key):
        l = r = self.header
        t = self.root
        self.header.left = self.header.right = None
        while True:
            if key < t.key:
    ...

and so that's probably not what the developer intended by adding just a little bit to our test suite we seem to have find a bug not anticipated by the developer of the splay tree and I think this example is illustrative for a couple of reasons.

  • First of all the coverage tool, the very first time we run it told us something that we didn't know. This in my experience in general is what happens when you use a coverage tool. It's basically very similar to the first time you run a profiling tool on a piece of code where it turns out that usually the functions that are using up CPU time are not the ones that you necessarily thought were using up CPU time--well, coverage tools are very similar. It often turns out that the stuff that you thought was going to run might not be running only some of it,

So it told us something interesting and that's nice. 

Now on the other hand if the coverage tool hadn't told us anything interesting that is to say if it told us that everything that we hoped was executing when we run the unit test was executing well then that's good too--we get to sleep a little easier. 

  • The second thing you notice is a much more subtle point, and this point is that we're out of the test case to execute this one line of code but it turned out that the bug wasn't right here. The bug was somewhere completely different buried in the splay routine, and if you go back and look at the coverage information, it's going to turn out that the splay routine is entirely covered--that is to say every line of the splay routine was executed during the execution of the unit test for the splay tree.

This serves to make a couple of points

  • first of all just because some code was covered especially at the statement level this does not mean anything about whether it contains bug in it. It just means that it ran at least once. 
  • The second thing is we have to ask the question, "What do we want to really read into the fact that we failed to cover something?"

The thing to not read into it is a bit of failed piece of coverage is a mandate to write a test that covers this test case. That's what we did, that's not a good general lesson rather the way we should think about this is that the coverage tool is giving us a bit of evidence. It has given an example suggesting that our test suite is poorly thought out. That is to say that our test suite is failing to exercise functionality that is present in our code and what that means is we haven't thought about this problem very well and we need to rethink the test suite.

So to summarize that, when coverage fails it's better to try to think about why we went wrong rather than just blindly writing a test case and just exercise the code which wasn't covered. 

10. Problems with coverage

We just looked to an example where measuring coverage was useful in finding a bug in a piece of code.

What I want to show you now is another piece of code. This would be a really quick example. The coverage is not particularly useful in spotting the bug. And so, what I have here is a broken function. The job is to determine whether a number is prime. That is to say, it doesn't correctly do its job of determining whether the number is prime.

import math

# broken
def isPrime(number):
    if number<=1 or (number%2)==0:
        return False
    for check in range(3,int(math.sqrt(number))):  
        if number%check == 0:  
            return False
    return True

So, let's look at the logic in this function. So, this prime takes a number. If the number is less than or equal to 1 or divisible by 2, then it's not prime. Now, once the number passes that test, what we are going to do is loop over every number between 3 and the square root of the number and check if the original input number divides evenly by that number. If it does, then of course, we don't have a prime number. And if all of those tests failed to find a divisor, then we have a prime number.

There is a little bit of test code here that checks how this function responds for the numbers 1 through 5 and 23, 24.

def check (n):
    print "isprime(" + str(n) + ") = " + str (isPrime(n))

check(1)
check(2)
check(3)
check(4)
check(5)
check(20)
check(21)
check(22)
check(23)
check(24)

And so, what to do is let's just run this code.

isprime(1) = False
isprime(2) = False
isprime(3) = True
isprime(4) = False
isprime(5) = True
isprime(20) = False
isprime(21) = False
isprime(22) = False
isprime(23) = True
isprime(24) = False

This code has correctly computed that 1 is not prime, 2 is not prime, 3 is prime, 4 is not, but 5 is.

Similarly, 20, 21, and 22 are not prime numbers. They all have divisors, 23 is prime, and 24 is not.

So, for these 10 examples, this prime function has successfully identified whether the input is prime or not--so, let's run the code coverage tool. We get the same output. Now, let's look at the results.

covprime.png

So, we can see out of the 20 statements in the file, all of them run, and none of them failed to be covered. Statement coverage gives us a perfect result for this particular code and yet another set is wrong.

Here of the same code and ID, so let's run out.

The output is what we expect, and what I would like you to do is do 2 things.

Q. First of all, find at least 1 test case, where you check a value for prime and the code here returns the wrong result.

And then second, create a new function is prime 2 that fixes the error. That is to say a correctly identifies for any natural number input. That is to say any input 1 or larger whether it's prime.

S. So, here we are back with our broken formality test.

import math

# broken
def isPrime(number):
    if number<=1 or (number%2)==0:
        return False
    for check in range(3,int(math.sqrt(number))):  
        if number%check == 0:  
            return False
    return True

check(6)
check(7)
check(8)
check(9)
check(10)
check(25)
check(26)
check(27)
check(28)
check(29)

And what I have done is I've changed the input a little bit so let's run these new inputs. And what we can see is the function has made a couple of mistakes. For example, it is indicated that 9 is prime, and it has indicated that 25 is prime. Of course neither of those numbers is prime.

So, let's go back to the code and look the mistake. The algorithm here is generally a valid one. It's fine to first of all special case the even numbers, and it's fine then to loop between 3 and the square root of the number. So, the problem is that Python's range function does not loop through 3 and the square of the input number rather it loops between 3 and 1 less than the square of the input number.

So, we can fix this, of course, easily by adding 1 to the square root causing it to actually test against the full range that it needs to test in order to have a successful formality test.

    for check in range(3,int(math.sqrt(number))+1):

So, let us run this code and see what happens.

So, 6 is not prime, 7 is. None of the rest of these numbers are prime except for 29.

So, the question is what happened here? Why did test coverage fail to identify the bug? And the answer is sort of simple--statement coverage is a rather crude metric that only checks whether each statement executes once. Each statement executes at least once that lets a lot of bugs slip through. For example, we can have a statement execute, but it computes the wrong numerical answer or what happened here a loop executed, but it executed for the wrong number of times computing the wrong result.

The lesson here is we should not let complete coverage plus a number of successful test cases fool as into thinking that a piece of code of right. It's often the case that deeper analysis is necessary.

11. Coverage metrics

Now, we’re finally ready to take a detailed look at several coverage metrics, and the first thing to keep in mind, there is a large number of test coverage metrics out there. If you read the linked article, it lists 101 test coverage metrics. At first, that sounds a little bit of a joke, but if you take a look at the list, what you see is that it’s not too hard to imagine situations in which most of these metrics are actually useful. And if you read it, you can see that these things would probably actually find bugs in real software systems.

So, what I’m going to do here is talk about a fairly small number of coverage metrics that matter for everyday programming life and I’m going to talk about a few more that are interesting for one reason or another because it can get inside into software testing for which you probably want actually use in practice.

Statement coverage

So, the first metric we’re going to talk about is called statement coverage and this is in fact the one that is measured by default by the Python test coverage total we looked at. And so, you already had a pretty good idea of what it does.

So let’s just go through it in a little bit more detail.

So, let’s just use this very simple four line codes that I put as an example

if x == 0:
    y += 1
if y == 0:
    x += 1

and let’s try to measure its statement coverage. So, let’s say we call this code with x=0 and y=-1. Well, in that case, this x == 0 test is going to pass and so y is going to be incremented by 1 making y=0. Now, if y == 0 will also pass and increments x.

So if we enter this code fragment with x=0 and y=-1, all four statements will be executed. So, this will give us a statement coverage of 100%.

So, that’s pretty obvious. It’s really quick, so let’s sort of call it with different values.

So now, we call this code with x=20 and y=20. Both tests will fail, and so, we'll end up executing both of the tests but neither of the branches of the test, and so, we will end up with a code coverage or 2/4 statements or 50%.

While we’re on the subject of statement coverage, I would like to also mention line coverage.

Line coverage

This is very similar to statement coverage, but the metric is tied to actual physical lines in the source code. And so, in this case, there is only one statement for each line, so statement coverage and line coverage would be exactly identical. But on the other hand, if we decided to write some code that had multiple statements per line, line coverage would conflate them where a statement coverage would consider them individual statements, for most practical purposes, these are very similar.

So, statement coverage has a slightly finer granularity.

12. Testing coverage

On the subject of statement coverage, we will take a very short quiz. The quiz is going to have two parts, both of which stem from a little statistics module that we have here.

def stats(lst):
    min = None
    max = None
    freq = {}
    for i in lst:
        if min is None or i < min:
            min = i
        if max is None or i > max:
            max = i
        if i in freq:
            freq[i] += 1
        else:
            freq[i] = 1
    lst_sorted = sorted(lst)
    if len(lst_sorted)%2 == 0:
        middle = len(lst_sorted)/2
        median = (lst_sorted[middle] + lst_sorted[middle-1]) / 2
    else:
        median = lst_sorted[len(lst_sorted)/2]
    mode_times = None
    for i in freq.values():
        if mode_times is None or i > mode_times:
            mode_times = i
    mode = []
    for (num, count) in freq.items():
        if count == mode_times:
            mode.append (num)
    print "list = " + str(lst)
    print "min = " + str(min)
    print "max = " + str(max)
    print "median = " + str(median)
    print "mode(s) = " + str(mode)

The function stats is defined to take a list of inputs and what the function does is computes the smallest element in the list, the largest element in the list, and the median element of the list-- that is to say if we order the list numerically the middle element where the average of the two middle elements if the list has an even number of elements and finally, it computes the mode of the list where the mode is the element which occurs the most frequently in the list.

In the case where the list is bimodal--meaning it has two modes or multimodal--more than two modes--we'll print all of it.

What we do here is test the stats function extremely badly, so I'm going to create a list containing only the number 31 and I'm going to call stats on the list and so let's look at the coverage that we get.

l =[31]
stats(l)

We can see here that even my fairly bad test managed to cover 29 statements in the stats function but these statements are uncovered and it's shown right here.

Q.Your assignment for the first part of this API quiz is to create a collection of test cases for it which is 100% statement coverage. What I mean by that specifically is that your job is to construct a several lists, calls stats on them, and cover all statements.

What I'd like you to do is think about it a little bit, try to visualize the effect that your inputs are having on the code, try to come up with the corruption of inputs that gets false statement coverage on the first try but then, of course, check your answer using the coverage tool.

S. And the key really was simply calling the stats function with two lists. One which had an even number of elements, the other with an odd number of elements and then furthermore at least one of the list had to not read the degenerate list containing only one element.

So if you do something like what I did here, you'll achieve full statement coverage, and let's just look at that quickly.

def test():
    ###Your code here.
    # Change l to something that manages full coverage. You may 
    # need to call stats twice with different input in order 
    # to achieve full coverage.
    l = [31,32,33,33,34]
    stats(l)
    l = [31,32,33,33]
    stats(l)

So here we can see that out of the 34 statements in the program, 34 will run and none were missing--so we achieved full statement coverage.

13. Fooling coverage

Well, let's take another quick quiz and this one, like the broken prime number function I showed you, is designed to pry into the limitations of coverage metrics.

Q. So your job now is to insert a bug into the stats module, that is to say make it wrong but in a way that's undetectable by test cases that get full statement coverage.

So test 1 is a function that you write which contains either the test cases you just submitted or different ones. These test cases together need to achieve 100% statement coverage for the stats function but we must not reveal the bug--that is to say you're broken stats function needs to return the correct answer for the test cases presented here.

The second thing I want you to do is define a function test 2, which also calls the stats function, and this one should reveal the bug.

Your assignment is to break the stats function in such a way that the flaw is undetectable by test cases that you design would get a 100% test coverage, but also you need to show us what the flaw is by supplying a second test case.

S. Well I hope you enjoyed trying to fool the test coverage metrics.

It turns out there's essentially an infinite number of ways to do that. The way I chose was to insert a call to the absolute value function,

    for i in lst:
        if min is None or i < min:
            min = i
        if max is None or i > max:
            max = i
        if i in freq:
            freq[i] += 1
        else:
            freq[i] = 1

So here we're iterating over the values and the list of numbers looking for the largest one, the smallest one, and competing frequency leading to the mode.

What I did here is so that i is equal to the absolute value of i.

    for i in lst:
        i = abs(i)
        if min is None or i < min:
            min = i
        if max is None or i > max:
            max = i
        if i in freq:
            freq[i] += 1
        else:
            freq[i] = 1

Now of course when the inputs are positive as they are in the first two test cases that achieved 100% statement coverage, everything is fine. On the other hand in the other test cases, what I do is call them with the list containing negative values and of course, it's going to compute the wrong thing for them. So the list contains -33 and -34, and clearly the minimum value in that list is not 33 and the max is not 34.

So what you can see here is I've met the requirements of the assignment.

As I said, really there are a lot of different ways you could've accomplished this effect and realistically, for purposes of this assignment which was to get you to think about the limitations of coverage a little bit, any of them would be just as good as the other.

14. Branch coverage

Back to coverage metrics.

We just talked about statement coverage, which is closely related to line coverage, but it's a bit more fine-grained, and now let's talk about what is probably the only other test coverage metric that will matter in your day-to-day life unless you go build avionics software.

Branch coverage is a metric where a branch in a code is covered, if it executes both ways. For example, to get 100% branch coverage for the statement that tests whether x = 0, it would be needed to be executed in a state where x was zero and also in a state where x was not equal to zero.

And so, in many cases, branch coverage and statement coverage have the same effect. For example, if our code only contained if-then-else loops, the metrics would be equivalent. On the other hand for code like this that's missing the else branches they're not quite equivalent.

if x == 0:
    y += 1
if y == 0:
    x += 1

We can take as inputs to this code x =0 and y is -1. These inputs were sufficient to get 100% statement coverage. On the other hand, these are not sufficient to get 100% branch coverage.

What happens is these cause the taken branch of the if to be executed but not the else branch. Then the taken branch of the second if to be executed, but not the else branch. There are different ways to score branch coverage, but one way we can do it is there are two ways to take this branch, two ways to take this branch, so we could say this is 50% branch coverage.

The other thing we could do, however, is do what our Python module for coverage is going to do, we could say that both of these branches were partially executed. That is to say, one of their possibilities was realized during testing. No branches were completely missed, and no branches were totally covered.

So, let's go ahead and see how our coverage module tells us this.

def foo(x,y):
    if x == 0:
        y += 1
    if y == 0:
        x += 1

foo(0, -1)

And so we're going to invoke the function foo with x being 0 and y being -1. Let's see what happens.

We're going run this under the coverage tool, but this time we're going to give the coverage run command, an argument 'branch' that simply tells it to measure branch coverage instead of just measuring statement coverage. It's again going to render some HTML as a result, and if we look at the output, it's going to tell us that out of six statements all six of them were run, and there were zero missing statements.

So, at the statement level, we've achieved 100% coverage. On the other hand, at the branch level, we had two branches that were partially executed, that is to say, only one of their two possibilities was realized during execution.

Now, if we change this a little bit by calling foo a second time with 0 and -2, run the coverage tool again, what we'll see is that the second branch, the test of y now is executed both ways. On the other hand, the first branch, the one that tests x still is partially executed.

15. 8 Bit adder

Now we're going to take a quick programming quiz on branch coverage. The quiz is going to involve some Python code that simulates some adders. Adders are very simple hardware modules that perform addition.

Here I'm just going to draw a 1-bit adder to show you how it works.

The code that you're going to test is going to be a cascading series of 8 of these. The way the adder works is it takes two inputs, A and B, and A and B are bits--so they're valued at either 0 or 1, and in Python we're going to represent the 1 bit as True and the 0 bit as False. So, we're going to use Boolean logic to implement this, and there's a carry bit coming in.

The output is a single-bit sum of A and B plus a carry bit out. The function implemented by the adder can be described like this: [S = A \oplus B \oplus Cin] The sum is the A input XORed with the B input XORed with the C input. To do an XOR on two booleans in Python, we can simply use the not equals operator !=.

The carry bit is going to equal to [Cout = (A \cdot B) + (Cin \cdot (A \oplus B)]. And so the "and" and "or" [cdot = and, \oplus = or] operators in Python are, of course, logical and, and or.

What we have here is a couple of boolean equations that together implement a full 1-bit adder. If we change these adders together, we can build up something extremely exciting like an actual adder that corresponds to the add instruction that you would find in an instruction set for a real computer.

Now let's look at the code. Here I have some Python code implementing an 8-bit adder.

def add8(a0,a1,a2,a3,a4,a5,a6,a7,b0,b1,b2,b3,b4,b5,b6,b7,c0):
    s1 = False
    if (a0 != b0) != c0:
        s1 = True
    c1 = False
    if (a0 and b0) != (c0 and (a0 != b0)):
        c1 = True
    s2 = False
    if (a1 != b1) != c1:
        s2 = True
    c2 = False
    if (a1 and b1) != (c1 and (a1 != b1)):
        c2 = True
    s3 = False
    if (a2 != b2) != c2:
        s3 = True
    c3 = False
    if (a2 and b2) != (c2 and (a2 != b2)):
        c3 = True
    s4 = False
    if (a3 != b3) != c3:
        s4 = True
    c4 = False
    if (a3 and b3) != (c3 and (a3 != b3)):
        c4 = True
    s5 = False
    if (a4 != b4) != c4:
        s5 = True
    c5 = False
    if (a4 and b4) != (c4 and (a4 != b4)):
        c5 = True
    s6 = False
    if (a5 != b5) != c5:
        s6 = True
    c6 = False
    if (a5 and b5) != (c5 and (a5 != b5)):
        c6 = True
    s7 = False
    if (a6 != b6) != c6:
        s7 = True
    c7 = False
    if (a6 and b6) != (c6 and (a6 != b6)):
        c7 = True
    s8 = False
    if (a7 != b7) != c7:
        s8 = True
    c8 = False
    if (a7 and b7) != (c7 and (a7 != b7)):
        c8 = True
    return (s1,s2,s3,s4,s5,s6,s7,s8,c8)

What you can see is it takes a0 through a7. That is to say 8 bits of input that constitute parts of a where a0 is the low order bit, and a7 is the highest bit. Then b0 through b7 indicate the bits of the second input where again B0 is the lowest order bit of B and B7 is the highest. It takes in an initial carry-in bit.

Q. The chain of logic here is a cascading series of full adders for the individual bits. As you can see, it's a little bit long. And so your problem for this API quiz is to come up with a series of calls to this 8-bit add function which get 100% branch coverage.

And let me give you a couple hints.

  • First of all, the way to approach this is probably to think about how you would test an actual adder. Most likely, if we're testing an adder-- that is to say let's say this code is part of some sort of a circuit simulation, and we want to test the validity of our circuit, we'd pass it actual numbers. How do you pass it actual numbers? Well, you can write in some little support functions for yourself that take integers and convert them into the bit format. That's probably how I would test this, and that's certainly not the only way.
  • But what you probably don't want to do is treat this as a black box. What I'm deliberately doing here is giving you a function that's going to be a difficult test if you don't understand this logic at all. What you need to do is think of this as an adder and then it really becomes fairly easier to test.

S. Let's go through a couple of solutions to this programming quiz where we're trying to get 100% branch coverage for an 8-bit adder.

And so they way I solved this is using exhaustive testing. The insight is that 8-bit inputs characterize the full range of values from 0 to 255. What I can do is write a function test_exhaustive which lets i loop over the range 0-255, j loop over the range 0 to 255, and then define a my_add function which is going to invoke the 8-bit adder with those inputs. Then what we're going to do is print the output and assert that the result is equal to the actual i + j output.

def testExhaustive():
    for i in range(256):
        for j in range(256):
            res = myadd(i,j)
            print str(i) + " " + str(j) + " = " + str(res)
            assert res == (i+j)

This goes a little bit beyond what I asked you to do for the quiz, but it's a good idea to make sure that the code is right.

So, now this myadd function is kind of where the magic happens.

def myadd(a, b):
    (a0,a1,a2,a3,a4,a5,a6,a7) = split(a)
    (b0,b1,b2,b3,b4,b5,b6,b7) = split(b)
    (s0,s1,s2,s3,s4,s5,s6,s7,c) = add8(a0,a1,a2,a3,a4,a5,a6,a7,b0,b1,b2,b3,b4,b5,b6,b7,False)
    return glue(s0,s1,s2,s3,s4,s5,s6,s7,c)

13

What myadd does is takes an integer a and splits that into the bits, splits it into 8 bits that represent the same value as the integer, do the same thing for b. Then we're going to call the add8 function with the bits of a and the bits of b, resulting in a set of bits representing the sum. Then, we can glue those back into an integer and that's going to be returned to check our assertion.

Let's look at the split and glue functions.

Split is pretty simple. It just takes an integer, and if n bit-wise and with 1 is true, then the lower bit must have been set, and so we'll return a true value for the low order bit position. Then we'll go to comparing with 2, 4, 8, 16, 32, 64, and 128.

def split(n):
    return(n&0x1, n&0x2, n&0x4, n&0x8, n&0x10, n&0x20, n&0x40, n&0x80)

Together these tests serve to split the input integer into a sequence of true and false values that together represent that integer.

The glue function does exactly the opposite thing. It takes a series of boolean values and glues them into an integer. The way we do this is by keeping a running total. If b 0--that is to say, if the low-order bit of the input is set, we increment a running total by 1. If the next bit if set, we increment it by two. If the third bit is set, we increment it by 4, then 8, 16, 32, 64, 128, and 256. And so together, this lets us reconstruct an integer value from its set of bits.

def glue(b0, b1, b2, b3, b4, b5, b6, b7, c):
    t=0
    if b0:
        t+=1
    if b1:
        t+=2
    if b2:
        t+=4
    if b3:
        t+=8
    if b4:
        t+=16
    if b5:
        t+=32
    if b6:
        t+=64
    if b7:
        t+=128
    if c:
        t+=256
    return t

So, when we call these things together, we can verify that the code actually implements an adder and incidentally, because we're testing it on all possible values,we're going to get full branch coverage. Let's just make sure that that's the case. Okay, so we just run the code. Now let's look at the coverage output.

Here we're looking at the coverage output, and what we can see is that of the 85 statements present in the adder, all 85 of them have ran, so none of them are missing, and there were 9 partially executed comparison statements. That is to say 0 comparison statements, which only went one way.

If we look through the code we can see that indeed, the coverage tool believes that it's all covered. So, now let's look at an alternative solution.

Instead of our exhaustive test, we could have written a much smaller test that gets 100% branch coverage. This is based on the observation that it we look at the cascading series of tests in the branch coverage, basically all that maters are whether the input bits are 0 or 1. So, if we call the adder with all 0 bits, we call it with all 1 bits, and then we need 1 more test case, we can get 100% branch coverage this way,

myadd (0,0)
myadd (0,1)
myadd (0,255)

So let's just make sure that that's, indeed, the case.

I'm going to run the adder again. This time it's not printing out anything. Let's go back to the webpage. This time we have 81 statements, and they all ran, and we have 0 missing statements and 0 partially executed statements. So, as you can see, the coverage tool believes that just these three really simple test cases were sufficient to get 100% branch coverage of our adder.

16. Other metrics

We looked a couple common coverage metrics that come up in practice. So, we've looked at:

  • statement coverage, which is a close relative to line coverage.
  • We looked at branch coverage, also called decision coverage,

and for most of you out there, these are the coverage metrics that are going to matter for everyday life.

Now, as I said before, there are many other coverage metrics, and we're going to just look at a few of them. The reason these are interesting is not because we're going to go out and obsessively get 100% coverage on our code on all these metrics. But rather because they form part of the answer to the question how shall we come up with good test inputs in order to effectively find bugs in our software?

Loop coverage

Loop coverage is very easy. It simply specifies that we execute each loop 0 times, once, and more than once. The insight here is that loop boundary conditions are an extremely frequent source of bugs in real codes. For example, if we had this loop in our Python code,

for loop in open("file"):
    process(line)

so for a line in open("file")--that is to say we want to read every line from the file and then process it--to get full loop coverage we would need to test this code using a file that contains no lines, using a file that contains just one line, and using a file that contains multiple lines.

MC/DC

Now, let's look at a fairly heavy-weight coverage metric called modified condition decision coverage or MC/DC. If that seems like kind of a big mouthful, there's a reason for that, which is that MC/DC coverage is required for certain kinds of avionics software. That is, if you're going to write safety critical software, where if the software fails, airplanes can fall out of the sky, then one of the things you'll need to do to show that you're software is correct is get a high-degree of MD/DC coverage.

It's designed to be pretty rigorous but still without blowing up into an exponential number of tests. MC/DC coverage basically starts off with branch coverage, so I'm going to simplify here a little bit, but MC/DC coverage basically starts off with branch coverage. It additionally states that every condition involved in a decision takes on every possible outcome. Conditions are simply boolean-valued variables found in tests, and decisions are just the kind of tests that we see in an if statement or a while loop, or something like that. Finally, every condition used in a decision independently affects its outcome. So, that's kind of a mouthful.

This is going to be hard to grasp unless we go through an example, so let's do that.

17. MC/DC Example

So, we're going to start off with the Python statement

if A or (B and C):

so let's make sure to nail down the precedents so we don't have to remember that. If A is true, or else both B and C are true, then we're going to execute some code.

And so now let's look what it takes to get full MC/DC coverage of this bit of code.

The first thing we can see is that we're going to need to test each of the variables with both to their true and false values, because the conditions, that is to say, the conditions are A, B, and C here, need to take on all possible values. We can see that each of the conditions is going to need to be assigned both the true value and the false value during the test that we run.

Now, the other part of MC/DC coverage--that is, does every condition independently affect the outcome of a decision--is going to be a little harder to deal with. Let's take a look.

Let's first consider the input where A is true, B is false, and C is true.

#  A        B         C
if True or (False and True):
#  True     False
#  True

We don't even need to look at the B and C part of the expression, because we know that if A is true, then the entire condition succeeds. This maps to a true value. What we want to verify here is that we can come up with a test case where every condition independently affects that outcome of a decision. Since our top-level operator here is an or, let's see how we can make the whole thing come out false.

While the B and C clause already came out to false, because if B is false, then the whole thing comes out to false. If we make A false, then the entire decision will come out to be false, and if we've changed only A and we haven't changed B and C, then we've shown that A independently affects the outcome. So, let's write another test case.

Our second test case with A being input as false, B input as false, and C as true,

#  A         B         C
if False or (False and True):
#  False     False
#  False

leads the overall decision to come out as false. We've shown now that A independently affects the outcome.

So let's try to do the same thing for B and C.

If we want to continue trying to leave everything the same and only change the value of one variable in order to establish this independence condition, let's this time try flipping the value of B. We're going to have A being false, B being true, and C being true.

#  A         B         C
if False or (True and True):
#  False     True
#  True

If we look at the overall value of the decision, what now happens is that B and C evaluates to true, so it doesn't matter that A evaluates to false. The overall decision evaluates to true this time.

By flipping the value of only B we've satisfied this condition for the input B. That is to say we've shown that B independently affects the outcome, because when we change B, the overall value of the decision went from false to true. Now let's see if we can do the same thing for C.

We're going to leave A and B the same, and we're going to pass in C as false. Now let's look what happens.

#  A         B         C
if False or (True and False):
#  False     False
#  False

B and C evaluates to false, and then also A is false, so the entire value of the boolean decision comes out to be false.

By only changing C and by seeing that the overall decision changes value, we've now shown that C independently affects the outcome of the decision.

So, what I believe we have here is a minimal of, if not minimal, at least fairly small set of test cases that together gets 100% MC/DC coverage for this particular conditional statement in Python.

You can see here that this isn't a particularly complicated conditional. We could've written one much more complicated, and if we had, we probably would've had a fairly hard time reasoning this stuff out by hand, and what we would've needed to do in that case is probably draw out a full truth table.

So, let's look at the idea behind MC/DC coverage. Why would this be a good thing at all?

What it's done is taken a statement that was really very easy to cover using either branch coverage or statement coverage. That is to say it's pretty easy to take this and make it either come out to be true or false overall on the force testing of the individual components of the boolean logic.

Basically, the idea is that when we have complicated boolean expressions, they're truth tables become rather large. What that means is there's a lot of complexity hiding in those truth tables. When there's complexity, there are probably things we don't understand, and that means there are probably bugs.

It turns out that the domain of interest for MC/DC coverage-- that is to say embedded control systems that happen to be embedded in avionics systems end up having generally lots of complicated conditionals. It's definitely desirable when people's lives depend on the correctness of these complicated conditional expressions to force people to test them rather thoroughly.

The other idea behind MC/DC coverage is that as part of establishing that every condition independently affects the outcome of a decision we're going to figure out when we have conditionals that don't independently affect the outcome of a decision.

If you think about it, why would you have a conditional involved in a conditional expression which doesn't affect the outcome. What that basically means is there's a programming mistake, and it may be a harmless mistake. That is to say, the extra conditional being part of a decision may not affect the correctness of the overall system, but what it means is somebody didn't understand what was going on, and probably there's something that we need to look at more closely and probably even change.

Another thing you can see, looking at MC/DC coverage, is that getting it on a very large piece of software is going to take an enormous amount of work. This is why it's a specialized coverage metric for the avionics industry where the critical parts of the codes end up being fairly small.

That's MC/DC coverage, and we're not going to take a programming quiz on that, since first of all, as you saw, it gets to be a pain pretty quickly, and second we lack good tool support for MC/DC coverage in Python.

18. Path coverage

The next coverage metric we want to look at is called path coverage.

Path coverage is a little bit different than previous metrics that we've looked at, because it cares about how you got to a certain piece of code. In general, things like statement coverage and branch coverage, and even to a large extent MC/DC coverage and loop coverage, don't really care how you got somewhere as long as you executed the code in such a way that you met the conditions.

So, path coverage cares how you got somewhere, so let's look at what that means. A path through a program is a sequence of decisions made by operators in the program. Okay, so let's look at the function foo, which takes two parameters, x and y, and does something x times and does something else once if y is true.

def foo(x,y):
    for i in range(x):
        something()
    if y:
        somethingElse()

What we're going to try to do is visualize the decisions made by the Python language as it goes through this program.

Let's first of all look at the execution if x is 0 and y is true.

  • Range 0 is the empty list, so we're not going to do anything here.
  • And then y is true, we're going to execute something else, and we'll leave.

So, this is one sequence of decisions that we made. We made the decision to execute something 0 times and something else once.

So, now if we come in with x=1 and y=true,

  • we execute something,
  • we execute something else,
  • and we leave.

This is a separate path through the code, because we made a different set of decisions.

Now, if we executed a third time with x coming in as 2,

  • we're going to go execute something,
  • go back, execute it again,
  • execute something else
  • and leave.

This, again, is a distinct path through the code, because we made a different set of decisions when we got to branch points in the code. As x increases in value, we get more and more and more paths. One thing you might ask is, just by changing x how many paths can we get through the code? The answer is that it's unlimited.

You can see that achieving path coverage is going to be impossible for all real code. What it does is gives us something to think about when we're generating test cases, because, of course, every possible path through the code might have a distinct behavior, so it is the case that we'd like to test lots of paths through the code. We can't test them all.

Now, there's going to be a similar family of paths for y is false. We're going to get a path where x is 0, so:

  • we're going to skip doing something.
  • And we're also not going to execute something else.

We essentially have 2 times infinity paths through this code.

So, path coverage is basically an ideal that we'd like to approach if we want to do a good job testing. It's not going to be something that we can actually achieve.

Now, let's take a really quick quiz on path coverage. I'm going to write a function here called foo again, and it's going to take three boolean parameters--x, y, and z. What the code is going to do is it's going to use each of them as a conditional in a test. So, if x is true, we do something. Otherwise, we do nothing here. If y is true, we do something. If z is true, we do something.

Q. The question I pose to you is how many paths through this code are there?

def foo(x, y, z):
    if x:
        ...
    if y:
        ...
    if z:
        ...

S. The answer to the quiz is 8, and let me show you why that's the case.

We came into the program, the paths fork based on the value of x.

So, now we have two paths at this point. Assuming that this code doesn't have any paths that we care about, We now come to another decision point--if y is true with fork execution.

But we also do it on the other path. So, now we come down here and test on z, and we fork again.

This illustrates, in addition to loops, why conditionals make path coverage hard, because we get an exponential number of paths on the number of conditional tests. Of course this is going to be completely impractical but on the other hand it could be that a bug lurks only here. We need path coverage to find a certain bug. This is a valuable weapon to have in our testing arsenal, even if we're not going to be achieving path coverage in practice.

19. Boundary value coverage

So next kind of coverage I want to talk about is boundary value coverage.

Boundary value coverage unlike some of the other coverage measures to be looked at doesn't have any specially take technical definition. It could be used to mean different things. We're going to look at it in a broad sense.

What boundary value coverage basically says is when a program depends on some numerical range, and when the program has different behaviors based on numbers within that range, then we should test numbers close to the boundary.

So let's take the example where we're writing a program to determine whether somebody who lives in the USA is permitted to drink alcohol.

So we want is that 21 years of age or more is fine for them to drink alcohol and if they are less than 21 that is to say 20 or less then it is not legal for them to drink alcohol. We just want to get the boundary value coverage on this program, we want to include the ages of 20 and 21 in our test input and possibly also to 19 and 22 simply close enough to the boundary values that there maybe interesting behaviors worth looking at as well. 

The insight here is that one of the most prominent kinds of errors that we made while creating software is 'off by one error'. Off by one error can almost always be triggered by a value at the boundary and so that's what we're trying to do here. 

What I've done here is frame the boundary value coverage as a function of only one variable and also I've framed it in terms of the program specification not in terms of the implementation.

So let's look at those two issue and term. So let's consider a program with two inputs. Let's assume for the sake of arguments that these inputs are treated independently by the software that we are running. So the first input is going to be the age of your car and your insurance company here, and we're going to decline to insure cars more than 20 years old. The other parameter is the age of the driver and here we're going to decline to insure drivers who are less than 18 years old. 

If the software treats these values independently we will probably be okay testing values around 18 independently of the age of the car and testing car ages around 20 years old independently of the age of the driver.

Now on the other hand, if we had specific knowledge of our implementation considering these variables together then we probably also need to test this combinations of inputs put in the other boundaries and you could see the problem here is that of course as the number of inputs of the program goes up the number of test cases can grow very large because we have to consider the interaction between all possible combinations of variables that are dependent. On the other hand, if our variables are independent then we can test this separately. 

So that was a brief visit to the issue of whether we are doing boundary value coverage with respect to the requirements of this specification with respect to the purpose of software or whether we are doing it with respect to the implementation.

Let's go and revisit the program we have a little bit earlier where I inserted a bug into our stats function which causes it to misbehave for some inputs and not for others.

def stats(lst):
    min = None
    max = None
    freq = {}
    for i in lst:
        if min is None or i < min:
            min = i
        if max is None or i > max:
            max = i
        if i in freq:
            freq[i] += 1
        else:
            freq[i] = 1
    lst_sorted = sorted(lst)
    if len(lst_sorted)%2 == 0:
        middle = len(lst_sorted)/2
        median = (lst_sorted[middle] + lst_sorted[middle-1]) / 2
    else:
        median = lst_sorted[len(lst_sorted)/2]
    mode_times = None
    for i in freq.values():
        if mode_times is None or i > mode_times:
            mode_times = i
    mode = []
    for (num, count) in freq.items():
        if count == mode_times:
            mode.append (num)
    print "list = " + str(lst)
    print "min = " + str(min)
    print "max = " + str(max)
    print "median = " + str(median)
    print "mode(s) = " + str(mode)

So if you recall, we have the function here stats which computes the minimum, the maximum, median, and the mode of the list of numbers and the problem I posted you in a quiz was to break this function in such a way that a collection of test cases getting good coverage on it wouldn't discover the bug.

And so the way I accomplished this was by taking the absolute value of the inputs and so here is a bug, what we're going to do is it's only going to be discoverable when we pass numbers into the function which contain negative values.

So if we think about what it takes to get boundary value coverage on a function like this and now we're talking about rounded boundary value coverage considering the implementation not just the specification what will happen is a function like absolute value would change its behavior around zero and so what we need to do get boundary value coverage of the absolute value function is call it with a negative argument and a positive argument. 

So to get good boundary value coverage on this function, we would have been forced to call it with a list containing at least one negative number and we most likely in that case would have discover the bug. 

Let's look at this function in its broader context. We have a lot of code here. For example, i < min, i > max

...
        for i in lst:
        if min is None or i < min:
            min = i
        if max is None or i > max:
            max = i
...

We have a lot of different operators in here but I'll have different behaviors around certain boundaries and so to get good boundary value coverage on this function over all would be probably extremely difficult.

I don't know a good tools automating this on Python and there are techniques such as mutation testing that can automate boundary value coverage at least in some general forms. 

20. Concurrent software

Now, let's talk about a slightly different issue, which is, "What we should do about test coverage for concurrent software?;

Now in general in this class, we haven't been dealing at all with testing of concurrent code and this is because mainly it is a difficult and specialized skill. Let's talk briefly about what coverage of concurrent software would actually mean. First of all, hopefully it's clear that while applying sequential code coverage metrics to concurrent software is a fine idea. Probably, these aren't going to give us any confidence if the code lacks concurrency here such as race condition and deadlocks.

So let's talk about how we would figure out if we've done a good job testing concurrent software. So let's take for example this function xfer, which transfer some amount of money between bank account one and bank account two. This particular function is designed to be called from different threads.

So what I've done here is mark a1 and a2 in red and these variables are representing the different bank accounts and then red in order to indicate that these are shared between different calls to xfer. And so the transfer function is going to be called from one thread so some thread is going to transfer money between accounts and also several other threads are going to do the same thing.

So what we have is multiple threads calling this transfer function and as long as the threads are moving money between different accounts, probably everything is all right. 

On the other hand, since the transfer function does not synchronize-- that is it hasn't taken any sort of a lock while it manipulates the accounts. If these threads are operating on the same accounts concurrently, then it's going to be a problem. We're going to mess up the balances of the accounts that are involved. 

And so I ask the question, What sort of coverage would we be looking for while testing this function in order to detect this kind of bug?

And the answer is some sort of a coverage metric to make sure that threads T1 and T2 both call this function at the same time while transferring money between the same accounts. So the coverage would essentially be T1 gets part way into the function and then let's say it stops running, the processor then starts to run T2 for which it operates on the accounts and then completes and then this interleaving of actions between the different threads would be what would constitute a unit of test coverage so that is to say we want to make sure we tested the case where the transfer account is concurrently called. 

So that's one kind of coverage that we might look for when testing concurrent software.

21. Synchronization coverage

Another kind of coverage that we might look for is going to be inspired by what happened after we fix this xfer function and so the likely fix for the bug that we had in this xfer function is to lock both of the bank accounts that we're processing, transfer the balance between them, and then unlock the accounts. And what this will do is make sure that no other threads in the system are messing with the two accounts that we're updating while in the middle of the messing with them.

Now one thing that people have found is while testing this concurrent software. This is often the case, you can delete all of the locks out of a code, run it through some tests, and often it passes.  This is really a sort of a scary and depressing thing because what this means is that during the test suite the locks weren't doing anything. 

This is in spite of the test coverage metric called synchronization coverage. What synchronization coverage is going to do is ensure that during testing this lock actually does something. In other words, during testing, the xfer function is going to be called to transfer money between accounts when the accounts are already locked and what this does is ensure that we're stressing the system to a level that the synchronization code is actually firing. It's actually having some effect on the execution and if that's happening what that means is we're probably doing a reasonable job stress testing the system. 

So that was really quick and like I said we're not getting into some detail.

In summary, if we're testing concurrent software, we want to be looking at concurrency-specific coverage metrics such as interleaving coverage which means if you recall functions which accessed shared data are actually called and in a truly concurrent fashion that is by multiple threads at the same time. And also synchronization coverage which ensures that the locks that we put into our code actually do something.

22. When coverage doesn't work

How do we boost a number of different coverage metrics?

I’d like to return to a topic that I discussed at the beginning of this lesson which is the input domain. What coverage is letting us do is divide up the input domain into regions in such a way, for any given region, any test input in that region, will accomplish some specific coverage task. Any input that we select within this particular region of the input domain will cause a particular statement or line to execute, cause the branch to execute in some specific direction, execute a loop zero times, one time, or more, etc.

And so, the obvious problem is, if we partition the input domain this way and we go ahead and test, it is easier for me to reach the region in the input domain, we’re going to get good coverage, but we’re not going to be able to do is find areas of omission. That is to say, nothing about this process of selecting points in the input domain and testing them in order to achieve good coverage is going to let us discover what we haven’t implemented.

So, let me give a quick example, typically, as we've talked about in lesson two, the software under test is using APIs provided by the system. Perhaps, the system under test is creating files on the hard disc. One extremely possible kind of bug that we would put in to the system under test is failing to check error codes that could be returned from file creation operations that happen when the disc is full or when there is a hard disc failure or something like that.

And so, what I really mean here is that, for example, we got a full branch coverage of the system under test but there just isn’t a branch in the test that should have been taken when the hard disc returns an error code. And so, if the branch that should be there isn’t there, when the disc does return an error code, something weird is going to happen, the software is going to fail.

So, the fundamental fact here is that coverage metrics are not particularly good at discovering areas of omission like missing error checks. To discover those, we need to use other methods. So, for example, we discussed fault injection where we make the disc fail, we will make it send something bad up to the system and we see what happens, and in that case, if we’re missing an error check, then the system should actually do the wrong thing and will be able to discover this by watching the system misbehave.

Another thing we could have done is partition the input domain in a different way. That is to say not partition the input domain using an automated coverage metric rather partition the input domain using the specification.

So, if we partitioning input domain based on the specification and the specification mentions the need for our system to respond gracefully to disc errors, there is going to be some little corner of the input space that is triggered only when disc fail and while we test that.

So the point that I’m getting to here is that there are multiple ways of partitioning the input domain for purposes of testing. Partitioning the input domain based on features of the code is one way and it happens to be quite effective in general, but since I can’t find the areas of omission, we also want to sample the input domain in different ways. We also want to sample the input domain in other ways and that’s the term that we will return to in full force in lessons four, five, and six.

All right, so that wraps up our quick survey of test coverage metrics.

23. Infeasible code

Now, I'm going to talk about the question--What does it mean when we have code that doesn't get covered--for example, if we're using statement coverage, what happens when we have some statements that we haven't been able to cover?

There're basically 3 possibilities.

  • The first possibility is that the uncovered code was simply infeasible. An infeasible code means exactly what it sounds like. It's a code that can't be executed under any circumstances.

Let's say that we're writing the checkRep function for a balance tree-data structure. At some point in the checkRep, we're going to assert that the tree is balanced then in the implementation of the balanced method, we're going to check if the left subtree and right subtree have the same height, and if not, we're going to return false. If the tree becomes unbalanced, we're going to return false causing an assertion evaluation.

Assuming that the code is not bugged and assuming that we're testing a correct tree, we'll never going to be able to return false from a balanced function. And of course, a coverage tool is going to tell us, we failed to achieve code coverage for this particular statement in the code. We have to ask ourselves, is that a bad thing? Is that bad that we failed to execute this line of code, and of course, it's not bad because that code only can execute if we made a mistake somewhere.

So, the proper response to this kind of situation is, we just need to tell our coverage tool that we believe this line to be infeasible and then the tool won't count this line against us when we're measuring code coverage, and so let's look an example of this--we have here is an open source AVL tree.

An AVL tree is somewhat analogous to the splay tree that we already looked at. in the sense that it's a self-balancing binary search tree. As you can see, there's quite a lot of code here. AVL tree has a reputation for being fairly complicated as these things go. The code comes with its own test suite so let's run it under the coverage monitor.

Okay. Now, let's look at the output.

You could see here that out of the 389 statements in the AVL tree we failed to run 61 of them and 20 branches were partially executed. So, you can see that there're some various minor failures of the test, which exercise interesting behaviors.

So, here is the AVL tree sanitycheck function

def sanity_check (self, *args):
    if len(args) == 0:
        node = self.rootNode
    else:
        node = args[0]
    if (node is None) or (node.is_leaf() and node.parent is None ):
        # trival - no sanity check needed, as either the tree is empty or there is only one node in the tree
        pass
    else:
        if node.height != node.max_children_height() + 1:
            raise Exception ("Invalid height for node " + str(node) + ": " + str(node.height) + " instead of " + str(node.max_children_height() + 1) + "!" )

        balFactor = node.balance()
        #Test the balance factor
        if not (balFactor >= -1 and balFactor <= 1):
            raise Exception ("Balance factor for node " + str(node) + " is " + str(balFactor) + "!")
        #Make sure we have no circular references
        if not (node.leftChild != node):
            raise Exception ("Circular reference for node " + str(node) + ": node.leftChild is node!")
        if not (node.rightChild != node):
            raise Exception ("Circular reference for node " + str(node) + ": node.rightChild is node!")

        if ( node.leftChild ):
            if not (node.leftChild.parent == node):
                raise Exception ("Left child of node " + str(node) + " doesn't know who his father is!")
            if not (node.leftChild.key <= node.key):
                raise Exception ("Key of left child of node " + str(node) + " is greater than key of his parent!")
            self.sanity_check(node.leftChild)

        if ( node.rightChild ):
            if not (node.rightChild.parent == node):
                raise Exception ("Right child of node " + str(node) + " doesn't know who his father is!")
            if not (node.rightChild.key >= node.key):
                raise Exception ("Key of right child of node " + str(node) + " is less than key of his parent!")
            self.sanity_check(node.rightChild)

and what it does is check a bunch of properties of the AVL tree and if any of them fail, raises an exception. Every statement here that raises an exception has failed to have been covered, and then the branches which test the condition, which leads to raising the exception only partially covered.

So superficially, we haven't gotten very good coverage of this function, but actually, of course, what we hope is that this AVL tree code is correct and these are truly infeasible. And if we really believe that, what we can start doing is go ahead and start telling the coverage tool to ignore this and the way we do that is to go back to the source code and comment some of this with a comment that has a special form pragma, no cover.

# pragma no cover

And then we go ahead do this for the rest of these and we should see that it does sort of thing. All right. So, now we run the coverage tool again and see if things look a little bit better. Okay. Good. We'll, it's starting to get all of this code. But we can see now that the coverage tool has marked these in sort of a light gray color indicating that it's ignoring the fact that they weren't covered because we told it to.

24. Code not worth covering

So case 1, the code that didn't cover is infeasible code.

The second case is the code that we believe to be feasible but which isn't worth covering.

A code might not be worth covering because it's very hard to trigger and it's very simple. So let me give a quick example. The res variable is the result of the command to format a disc. And if that operation fails, we'll just going to abort the program.

res = FormatDisk()
if res == ERROR:
    abort()

And so what might be the case is that we lack an appropriate fault injection tool that will let us easily simulate the failure of a format disc call. Furthermore, the response of the code in this case is to simply abort the execution of the program. If those two things are the case, then we might be perfectly happy not to test this code branch, and the reason is the abort code, which is going to terminate the entire application, is presumably something that was tested elsewhere, so we don't have any real suspicion that it might fail and there's no reason to think that calling it from a location like this would act any differently or will probably be okay.

25. Inadequate test suite

The third reason the code might not have been covered is that the test suite might simply just be inadequate, and if the test suite is inadequate, we have to decide what to do about it.

So one option of course is to improve the test suites so that the code gets covered. The other option is not to worry about it, and of course, since achieving good coverage could be difficult, it can easily be the case that we'll just decide to ship our software without achieving a 100% code coverage, and there is nothing inherently wrong with that because coverage testing, like all forms of testing, is basically just a cost-benefit tradeoff.

The question we have to ask is are we getting enough value out of doing the testing to offset the cost of doing the testing. And so if do we ship software with uncovered code, what we basically have to do is admit to ourselves that we're perfectly okay with the fact that the customers who are on the receiving end of the software might be the first people who will actually run it.

26. SQLite

So, what I'd like to do now is talk quickly about an example of testing done right.

The example that I'm going to use is an open source database called SQLite. And you can read about it more at the here.

What SQLite is, this small open source database which is designed to be easily embeddable in other applications. It's really very widely used by companies like Airbus and Dropbox. It's used in various Apple products, Android phones run SQLite inside and really a lot more examples.

And by the way, SQLite has good Python bindings if you want to use it yourself.

So, this database is implemented in about 78,000 lines of code. So, you can see it's a sort of a medium to small size project as software projects go. Now, on the other hand, the test suite for SQLite comes out to more than 91 million lines of codes.

So another way to think about that is, it's more than 1000 times more tests than code.

So, if we have SQLite's code over here, it can be dwarfed by this gigantic mountain of test code. And so we might ask as well, what's in this test code? How is SQLite tested? What it turns out is the author of SQLite has achieved a 100 percent branch coverage and 100% MCDC coverage. The author has done a large amount of random testing and that'll be the subject of lessons 4, 5, and 6 of this class. Counter value testing has been used. The code contains a ton of assertions. It's one of the valgrind tool I mentioned earlier that is good for finding bugs in programs written in C. SQLIte is subjected to integer over flow checks and a large amount of false injection test are performed.

So, what you can see here is that almost everything I've told you about in this course so far. Almost every single testing technique and many of the coverage metrics have been applied to SQLite, and the result is generally, a really solid piece of software.

One reaction you might have when you see a thousand times more tests than code is that it's pretty extreme. On the other hand, when you consider the large number and wide variety of users that SQLite has, it's pretty great that it has been tested so thoroughly.

Of course, what I'm not doing here is advocating that you go out and write a thousand times more lines of tests than you have code. On the other hand, it's nice to see that certain people who have extremely widely used codes actually do this kind of thing.

27. Automated whitebox testing

So the next topic that we're going to cover is called the "Automated Whitebox Testing." And this isn't in the form of code coverage but what rather is way to get software tools to automatically generate tests for your code, so you wrote some code and the questions were going to ask is How to generate good test cases for it?

And of course one answer is we can use the kind of techniques that I've been talking about for this entire class. We can think about the code. We can make up inputs. We can basically just work hard to get a good test coverage but another answer is we can run one of these automated whitebox testing tools and so let's see how that works.

def foo(a, b, c):
    if isPrime(a):
        b += 3
        a -= 10
    if isPrime(a):
        if b % 20 == 0:
            return 7
    return 0

This tools goal is to generate good path coverage of the code that you wrote. So let's start off basically just making up random values for your code. Let's say one one and one for the inputs. So now let's just go ahead and execute the code.

foo(1, 1, 1)

The first question, is this a prime number? And so it's not. It wasn't prime the first time but it's still not prime so we're going to return zero.

So now that the automated testing tool has seen a path through the code that didn't take both of the if branches so we will try to construct the new set of inputs for the function but take a different path. 

So the most obvious choice to start with is the first if. So the question the tool is going to ask is, "How can a generated input that's prime?" And so to do that of course, it's going to have to look at the code, to test for formality, so it's going to end up with these sets of constraints on the value of a which are going to be passed to a constraint solving tool and the answer if the solver succeeds is going to be a new value of a that passes the formality test, so let's say a is three.

Though automated whitebox testing tools come up with a new set of inputs to this function and it's going to go ahead and run it again.

foo (3, 1, 1)

So this time the first test is going to succeed a is prime, we're going to increment b by 3 decrement a by 10 and now a is going to fail the formality test since let's assume our formality check one at a time detect positive. Now the new value of a minus 7 is going fail the formality test and we're going again return zero.

So the question we have is, was the tool learned? What it hass learned is one execution that falls straight through. Another execution that takes the first if branch, so now what it's going to do is try to build on that knowledge to generate inputs that also take the second branch. 

So it's going to take the first set of constraints that is the constraints that cause a to be prime. It's going to add another set of constraints that force the updated value of a that is to say a value of 10 less than the original value of a to be prime. So it's going to turn that all one to a set of constraints pass it to the solver and the solver is either going to succeed in coming up with a new value of a or possibly it will fail.

But let's assume it succeeded and so let's say that the value of a that comes with this time is 13. We're going to execute the function again, 13 is prime, so we're going to add 3 to b subtract 10 from a giving 3, 3 is prime, so now we're going to ask if b is an even multiple of 20. If so we would return 7 but it's not so we're going to return zero.

The third time through the function, it's going to add a new constraint. So not only are we keeping all our constraints on a but writing a new constrain on b the b mod 20 has to come out to be zero and so this time the solver, let's say, comes up with a is 13 and b is 20. Now it's going to execute the function another time, this time returning 7. 

And so by iterating this process multiple times that is to say by running the code and then using what it learned about the code build up a set of constraints to explore different paths what we can do is generate a set of test inputs that taken together that use good coverage for the code under test.

Unfortunately, I don't know of any automated whitebox testing tools that exist for Python for the C programer there is a tool called Klee that you should try out which implements these techniques, and I encourage you to do this, Klee is really an interesting tool.

And so as you might expect, in real situations a tool like this might fail to be able to come up with a useful system of constraints or to solve them for really big codes and in fact that's absolutely the case These tools blow up and fail on very large codes but for smaller codes like the kind of thing I'm showing you here they actually do a really nice job of automatically generating good test inputs.

As it turns out these techniques are used fairly heavily by Microsoft to test their products in the last several years for the finding of a very large number of bugs in real products.

28. How to use coverage

We're going to finish up this lesson with the summary of how to use coverage to help us build better sotfware.

We're going to have to start off doing a good job testing and there's no way around that.

The next step is to measure the coverage of the test using some coverage metric that is appropriate for the software that you're testing.

If the coverage was pretty high, let's say 80% or 90%, then what we should do is use the feedback from the coverage tool and use this feedback to improve our test suite then measure coverage again and if the coverage results were poor, let's say maybe we've only covered 20% of the statements in our code base, that's the signal that we need to rethink our testing charting.

Of course, if could be the case that we're perfectly happy with poor coverage. There are plenty of scenarios and for example, web application development. We don't need good coverage because our users are going to test the code for us, and for breaks, we'll detect it by looking at error logs and we'll be able to fix it on the fly.

On the other hand, if we have poor coverage results for some sort of Avionics software, or automotive software, it's going to be deployed and that's extremely hard to update, we probably need to rethink our plan in a really serious way and try to come up with a much better test suite in order to get a higher level of coverage.

If coverage is used in the fashion that I've outlined here, it can give us a pretty significant amount of bang for the buck, regardless of the result, regardless of whether it's giving us a little bit of feedback to get the last 5% or 10% of improvement and coverage or whether it's saying that our testing strategy really isn't very good, it's telling us a useful information.

It's telling us a useful information we probably need to know if we're going to create a high quality software.

And finally, I just want to finish up with a reminder, so we strongly believe that if we have a good test suite, and we measure it's coverage, the coverage will be good. We do not believe, on the other hand, that if we have a test suite that gets good coverage, it must be a good test suite. And the reason that, that is not true, is that it's really easy to take a test suite and tweak it until it gets good coverage without actually improving the quality of it very much.

To finish up, used in the right way, coverage can be a relatively low cost way to improve the testing that we do for a piece of software. Used incorrectly, it can waste our time, and perhaps worst, lead to a false sense of security.

back