cs258 »


CS258 Lesson 5. Testing in Practice


1. Introduction

Now, as we have finished with random testing, we'll talk about more advanced issues. Such as, what to do, if a fuzz tester overloads you with bugs. It sounds kinda silly, but it can really happen in practice. We are also going to talk about, how to take a large test case, that makes software fail, and turn it into a very small case. Finally we'll talk about, how to report bugs in such a way, that developers are more likely to pay attention to them.

2. Overwhelmed By Success

We can be overwhelmed by the success of our bug-finding effort. If you take a large software project that’s never been subjected to random testing, and hit it with a sophisticated random tester, this may very well happen.

This situation make look something like this: we run our tester over the weekend and come back to find 250 failed test inputs. Now not all of these failed inputs are going to correspond to a unique bug. Of the 250 failed test inputs, maybe 10 or 50 will be bugs.

There are two main ways to deal with bug inundation:

  1. Report A Bug
  2. Bug Triage

In the first solution, we simply pick a bug and report it. Then as soon as we get a new version of the system, we run the random test again, see which previously troubling test inputs are no longer problematic, and then report another bug. This turns out to be an effective strategy for smallish systems, where bugs can be fixed quickly.

On the other hand, if the people fixing bugs have a slow fix-cycle. Let’s say we get a new version every couple years. If this is the case, then we employ bug triage.

A bug triage is the process by which the severity of different bugs is determined, and we start to disambiguate between different bugs. This helps us get a handle on which bugs we can report in parallel. Any inputs that trigger separate bugs can be reported in parallel, but if we report all bug-triggering inputs that we found, we’re going to be causing a lot of duplicate bug reports

How do we start getting a handle on which bug-triggering inputs map to different bugs, and which ones map to the same bugs. There’s no silver bullet, but we do have a number of different tools to disambiguate bugs. 

  1. In the simplest case, the bugs in the system are causing assertion violation messages. One thing we can do is disambiguate based on assertion messages. You look assertion messages, and make the assumption that distinct assertion violation messages are caused by distinct bugs in the software under tests. This does not have to be true. We can one defect that maps to multiple outputs that look different (although this is unlikely). Another scenario is that we have multiple defects cause the same symptom. What we hope happens is that a single defect maps to a single symptom.

  2. Unfortunately, not all bugs resolve to nice assertion violation messages, and bug disambiguation can be trickier when all we have core dump ( a dump of the contents of main memory) or stack trace. These are going to give us some indication of what part of the code failed and the stack frames leading up to that failure.

  3. Our third weapon when doing bug triage is to search over the revision history of the S.U.T if we have access to its version control system. If it’s the case that a certain group of bugs appear just one revision bugs, and the other bugs are old. Then, it is likely that group of bugs is being triggered by code that was recently committed.

  4. Our final weapon is to examine the test case. Often it’s the case that test cases that trigger the same bugs have similar features. The problem is looking over large, randomly generated test cases is really painful. This leads us to test-case reduction or test-case minimization.

5. Test Case Reduction

So does construction is the process of taking some large implode the triggers a failure and turning it into a small and and it's usually the case that many bugs can be triggered by small insights but on the other hand that's often the case of for example we discover firefox crashed webpage the cause of the firefox crashed might be giants probably huge risk of a spot in the wild of sicily speaking the pattern in the center of the cheers the firefox crashed probably is a good we found in some small webpage so we can do is by hand figure out what part of the in because the test case so one thing we can do is eliminate brilliant but and sometimes you do this in a smart way for example we know whom i just know that some part of it is unlikely trigger crash or it might just chop saw not blindly and see if the smaller and the triggers the testcase if it doesn't then we go back to our regional test case and try again for the does sonam areas this for really lucky at the very end of this process weather for the really small testcase and but that's the thing we'd like to report to the people developing a softer and attached of course even a four not reporting bugs to someone else for this to say we're fine bugs and suffer the rewrote the still really nice teva minimize test case because he's made a much easier to track down fair so option one minority option we've been done by people dividing for probably about as long as computer science has been around the second option is really cool to really really nice technique who's called alter the bugging a lot of business process so if you can write a program but can't tell automatically if a particular input triggers a failure that is to say you load up the webpage in firefox and see if it crashes by looking at the exit code that it provides to the operation storm then both of the bugging is a framework the takes your script and takes the test imposed and automates this process in the loop and this loop terminates win the delta divider which have a bunch of teristics built in for a brief eliminating president but it terminates when the start of the mugger can reduce the anthony moor so i don't want to do is going to this technique in a ton of detail because bill geiger developed adult of logging is gonna be teaching you testify sometime soon as probably gonna be really interesting and well i hope is that have intriguing enough fissile really seriously consider taking his class is gonna be about the budget right now show you an example.

6. Delta Debugging

Okay here are back on one of my work machines and ran a look at a test case which makes the latest version of gcc crash let's not to say the latest release version but um... version from their subversion repository but it does not recently and we can see this kind of data we uh... because size it's about forty kilobytes election it's not that big a threat of test cases go sometimes the bugs that your compiler crashes are much future like hundreds and hundreds of kilobytes anyway so it was very small let's not soaring to do now it is when adults mudra the first let's make sure we know it crashes crying gcc is what i call them g ccs but dole from va repository alright it seems to have croat in initial dufferin duction that tree that group three two two two so decision violated inside the compiler i'm happy that did become otherwise this might have turned into a bug because uses the degenerate wrong code and never want that to happen because the next thing we need is an automatic script the detective for particular program has the spa already made the script call a test on the stage and what it does is compiled the program at the particular optimization level of three and then searches the outfit for this particular stress-strain initialed a friend action and reasoning for that is is that we want to be sure alittle can crash gcc but the crash deciduous particular barbara students by grade the stuff script will return zero to the operative system if it succeeds overturn something else but they ought to know we're gonna do is invoke the details of the barber sensible scripture makes our work who's gonna go ahead and try to minimize this test program forzieri does ok settled to adulthood the mother just finished and no doubt you test is excellent adress of cut-out the several minutes out there waiting for that to finish but it took about maybe four minutes to the point of telling us is that they're waiting so we'll see how big the program is okay it's not that small but patel seven kilobytes was a real soul the dot order of a bunch of the job that's still going to be a little bit bad if it's not about so will be able to cut down on some kilobytes by hand if we want to fairly rapidly much faster than we would have been able to deal with the original unit forty kilobytes something you might be centers office or those of us all well and good duncan a test case reduction or compiler and put it in dot blogging they're really nice thing about both of the bugging is it works for any text-based input format and it's really nice so ever try to reduce an html document the calls the web browser crash we can use dot bugger if we're trying to reproduce the javascript program because the javascript interpreter crashed wikinews dot blotting anything that's text we can use it for so everything back to the bounded q data structure though we develop a random tester for those particular random tests didn't have any external manifestation those only existed as sort of data structures internal type on and so we can deduct money on those because that it is no text file but what we could have done is before and ran a test of the q because sale goes to a text file that is the status translating the operations were doing historians and then once that is saved on desk we can run the belts and motor on it they come up with minimal set of q eighty operations because the key to do something wrong and i and i would probably a really good idea south of the bodies extremely powerful technique testcase reduction is something that we really really really need to have in practice especially since this ties in with the triage idea that i talked about a little bit earlier so remember that the one of our triage methods is looking at the test kits while looking at big test cases is really hard as a bunch of junk but sitting in a way obscuring the actual features or causing the system to crash or otherwise misbehave and so if we can do testcase reduction before doing the bug triage then the whole process becomes really a lot easier.

7. Reporting Bugs

And i'd like to talk a little bit about the art of reporting bugs slip this in the a successful creating some sort of really nice ran a test of the starts crashing utilities so it's a purge handled as you want to your own voting study so if you recall opposing paper was written in nineteen ninety on israel in nineteen ninety five than another two thousand another two thousand six sales two thousand twelve a specific time for a new one so you do it and so what you do as you go take whatever operating system you choose and start crashing software hopefully instead of just being happy about what a successful effort you may have had your report the bugs to the maintainers so they can start improving the robustness of the software missiles talk about how to do that the first rule and to really simple is not reported to talk about will void reporting to put it by this is a good about reporting systems the most open-source projects maintain and search on some symptom of the book that you found and see if it's been reported so let's say we found a way to get a segmentation fault and all the uh... we'll see if anybody else has done that way well okay so we'll have one reported here that's really nice so might be saying so what we do then if we want to report a new segmentation faults know them as we go ahead and look at this and see if the new an organ report seems to be sufficiently different as it is though probably go ahead and reported the next golden rule for reporting bugs is to respect whatever the local conventions are of the community as your reporting to so the best thing to do here is just look for some bugs have been reported and see what kind of discussions they generate and try to take a some sort of a similar tone toward people heard it on third thing is to report small test kits policy here a small stand-alone testcase so the arm this case musician stuff that we talked about a little bit earlier is really important for purposes of bug reporting and that's all about that kind of a process should always be part of your bug reporting and the second clause here a stand-alone test cases also real important so this means for example building report a bug the depends on some other file but only exist on your machine because sa very useful to the people who are going to try to reproduce the bug because of course that i have the other file so do what you can to make them a report stand alone the next rule it is only report valid test kits and so this is referring back exactly to the infertility problem we've been talking about them but talked about a number of times in the mature like random testing and i mean if the software torture reporting bugs isn't supposed to programmatically reject invalid inputs then he shouldn't report an invalid and and so an example of a system is for example c compiler isn't supposed to reject invalid invokes and another example might be something like internal communication between javascript engine web browser there may not be colin for checking on some sort of an internal interface like about and so basically we just won't be sensitive to the fact that some systems can't do or are too quick to do fold police check of their input and the systems we often need to make sure the input is about ourselves by hand of course other systems and we've talked about things like that for my reader or whatever is this was to reject invalid and that i'm not supposed to crash some sort of buffer overflow even indulgent for those kind of programs decision doesn't really exist the next thing you need to do is tell the people where the recipients fear about report but the expected output of the system was that is what you hoped would happen if the system what somebody know what the actual output was as i say what really happened on your system because of the temptation is not supply these things because you think was insulting everything that's really obvious but in general it's a good idea to err on the side of caution and include the site so you might include a bug report to the python the tenor saying that added one plus one i was really expected to but then this implementation this particular version of the implementation thanking you need is to make the failure reproducible because nothing will be about reported more faster uni completely reducible and so you need to include things like platform details so for example is probably crucial information that you're running firefox and windows postal annex if u or try to report some sort of firefox crashed the exact version of the soft under test is important and of course if the some sort of an open source project deepak yourself before compiling you will meet definitely need to included tales about that in the by reports because of the body of your fault then you don't always people's time trying to track it down so summary is a br a boat come up with good solid answers all of these comps requirements for that's a great idea to report about this was going to do that right now.

8. Example Bug Report

So i want to do is go back to but what terminal so what did you see see bugged remote-control down but i believe the slum area with a little bit so we said it kind of syracuse small dot c it's going to die with an assertion violation about bizarre and about and on them live into the entry block and that's a surgeon fail so we need to do first is check if allow them people are you know about this one sorbet go here to there bugzilla search for this exact strang they don't know about this books that size so we know about about they don't know about it we can report it and i have to go back in the gorgeous test-retest put it in here is instead of invoking the delta bugger which is nice extremely general purpose powerful tool move over different full called serious concern please by my group and when it is extremely special purpose dot at the borders or operate on exactly the same dot on the body ideas that the f_b_i_ report operates by and it just as extra knowledge embedded in a about how to really see programs to take a little while so you have to wait for it like i do that's what i'm trying to see how long it takes i thought about eleven minutes so incredibly quick not too shabby either member this was wasn't uh... time that i had to be attending to a computer the computers just to an automated search opera and security out of here though but there is a test case is pretty small sports jackets but count situated sonya four points that's nice so i'm gonna do here is mick about report and what i'm a first do it is picture another version of crying so its version one five six my seventy components uh... much of a test case kits include and i'm going to show clank rational at effects whatsoever for and so if i haven't appeared to go through all of the steps that i narrated to earlier because of report a lot of compiler bugs i know i think what i can get away with so uh... girl so this includes i believe enough information for the healthy young people to reproduce the bug should be good bookmark with reporting and strong and we just need a name for this book report ok bugzilla strive to help us ovoid the duplicate i don't think it's told us about anything that we didn't know effect on his own mark is fixed anyway so we're good and was shipped a song frigate now they love him developers who make it up in the morning have a uh... the barber portal and that concludes are dead-on but reporting.

9. Building A Test Suite

That's all i can talk just really briefly bubble in a test me for a piece of software such a sweet is just a collection of past as often the case for the test we can be run automatically it's also often the case that this week it's run periodically so for example perhaps in nightly on every comment or make is that some visible since if commits a frequent and test cases slow below it s free it's a show that some software under test person desired properties namely passing all the tests although it's very common for real software to almost always be in a state of partial failure for hope is that most of the time most of these failures and optical and severe winds so the question is containing a software project wasn't asleep such a large extent of the matter of taste and preference but on the other hand it's a pretty common features of nearly all tests which is very common first of all talk a lot of unit s for these features specific test are small test that exercise very specialized behaviors so for example for developing some sort of a web browser who might have test infected different html not surrender correctly and that sort of thing also very common protest we to contain large realistic and thoughts so for example for testing some sort of a microprocessor they would look like summer for a couple of hours the purpose of these kind of inputs esterified realistic stresses on the system and exercise a lot of features in combination which of the things are upset nearly always a good idea to include a regression tests in its history regression test is basically any input it's caused any version softer under test fail at any time for several reasons the regression tests exist of the main one of which is the one i make sure that the software the test of the regrets that has to say it doesn't go back into a state in which a fails on about what we are a fixed their number of reasons why back it happened first of all regression socs will because whatever the defect was in the software because the bug in the first place we might not gotten rid of all the instances of that defect in the source code so for example about a piece of good light of a cut and pasted to several other places and those other locations might not be causing arc system to fail currently but some other change my enable the bodies of the fire and that might happen again another reason is pretty easy through for example basalts of the revision control system tax only go back over to a file before we fixed a bug if that happens who i catch as soon as possible because because some regression test right side of the reason is that defects and software confirm occurs in people's thinking it's pretty often the case of the person in there it is a defect in the software didn't actually correct the error that they had and i think rather maybe somebody else fixed the defect and the person retains the mistaken assumption about some sort of an a_t_r_ something and due to the fleeing the error in somebody's head a they can go ahead and start having similar defects to the system later on and if we have good regression tests we spend more of a chance of catching those kind of things something that usually doesn't go into a test suite as a random test and for whatever reason not to assure that i understand all the reasons even random testing often treated as a separate activity was related to the fact that reading tests often on deterministic mustering careful herbs are the same c they don't have a clear correctness criterion and perhaps more importantly read it s all summer possibility shana something new has to say they have the possibility of introducing a test case so we haven't seen before factors we hope will happen and remember the undesirable as the test results to be predictable disposed to consist of things that we know to test for now if all of a sudden the district it's it's a new and different tests then that's not necessarily good so-so for whatever combination of these reasons when testing is often a separate activity.

10. Hard Testing Problems

Sometimes we find urself faced with really hard software testing problems lead over some of the characteristics of these problems michaela specification is comin or perhaps only lack of a good specification there no copper role implementations but the kind of system right behind us to save the for system with sort alexis and quite hard cause it'll probably means is we're devoting a special occasion for even developing the specification as we go big systems or her to test large highly structured and its bases adjusting quite hard and so thought i'd imagine sort of a hard testing problem with uh... bartel a structured into space consider for example the flight control computers on a spacecraft or on an airplane these things take sort of an enormous right if input from all sorts of different redundant sensors the time at which these and that's the arrived a significant the space craft or the airplane has all sorts of physical copies like its altitude its attitude of the position of the various control surfaces lol affect the dynamics is gonna systems are really really sort of truly hurt attached not determined to make a system very hard to test an issue here is that will play a test case against the system wants anna succeeds but that at some later time some variables not under our control because the system to fail on that same input lots of kids they make systems are to test so to some sort of hurdles are extremely hard to test a for example java virtual machines but are run by for example financial organizations on lots of course with huge amounts of memory you think that's a so much internal state when something goes wrong it's almost impossible to make any inferences about what was going on inside of it and you need to try to reproduce the problem but of course is also extremely hard because the problem probably happened three hours into some sort of a massive prostitute at finally free lack strong oracle's testing can be really hard and so for example sort of like a large molecular simulation might be very hard to test sort of some sort of a new simulation code we've no idea what the right answer is supposed to be probably it's running on some sort of a large parallel machine to have a very were hard time reproducing problems from such a long time to occur to move out of how to be extremely hard to test because response of the thing is going to be inherently in terms of the stability and good behavior of the airplane and of course this is an incredibly large complex physical object this hurdle mollen simulate in a reliable fashion dipstick about making a strong test or offer our pilot the phones almost inconceivable everybody census of the giant gpm an influx of course for a long time using huge amount of cheap very very hard to test the behavior of something and i kind of a state for the personal ask ourselves is how we handle these situations hash we test these things and often there are really any easy answers well we can do is we can leverage week oracle's to a maximum stamp possible if any of these things the simulation help rather than see them she crashes in about fashion and we definitely know something's gone wrong we try to bootstrap some degree of confidence in the software under test putting small test inputs for which we can check the output and trying to argue that for example somehow dal tile it if it responds well for these and that's also responds well for other test and in the end if reported in our attempts to do really good testing mcauliffe to rely on mon testing methods because of course we should be doing could inspections in using formal methods or systems in any case if we care about the reliability what's happening here if you really can test the system effectively whom i have to rely on these things more than we would like because that's just a quick survey of things that we can make testing really hard in practice.

11. Summary Of Testing Principles

  • Testers must want software to fail
  • Testers are detectives: they must be observant for suspicious behavior and anomalies in the S.U.T
  • All available test oracles should be used in testing
  • Test cases should contain values selected from the entire input domain
  • Interfaces that cross a trust boundary need to be tested with represent-able values not just those from the ostensible(obvious) input domain
  • A little brute force goes a long way
    • Sometimes, selected interfaces can be exhaustively tested
    • Almost everything else can be randomly tested
  • Quality cannot be tested into bad software (therac-25)
  • Testable software has:
    • no hidden coupling, side channels
    • few variables exposed to concurrent access
    • few globals shared between modules
    • no pointer soup
  • Code should be self checking, whenever possible using lots of assertions; however:
    • these assertions are not used for error - checking
    • assertions must never be side effecting
    • assertions should never be trivial or silly
  • When appropriate, all three kinds of input should be used as a basis for testing
    • APIs that are provided by the S.U.T. can be tested directly
    • APIs used by the S.U.T. can be tested using fault injection
    • non-functional inputs (multi-threaded)
  • Failed coverage items do not provide a mandate to cover the failed items, but rather give clues to ways in which the tests are inadequate

Alright, We've come nearly to the end of our course. What I'd like to do now is summarize what i think are the high points that is the most important testing principles that I've tried to convey in this course and put them all in one place. So let's go through this. First of all testers must want software to fail. Second, testers are like detectives who are hunting down bugs. As detectives, testers have to be observant to all sorts of suspicious behaviors and anomalies in the software under test. My guess is the number of really serious bugs that occurred are things that had already been notice by people but had been swept under the rug because the people are busy, the just wanted to ship the product or maybe they were users who didn't know what the bugs meant. Users do you have the luxury between or embarks the testers dot and so it's really important not to sweet things under the rug all available test oracle's should be used as a basis for testing lest you might be tempted to think from the language that i was using that is strong oracle stresses week one that if we had a couple of good strong oracle's available maybe we need them we've got a buddhist wouldn't use the mall and i hope i convinced you buy now was not the case law all of the oracle should be used for the call generally detect different kinds of faults and even if they detect the same faults week oracle's might much cheaper to use test cases should contain values selected from the entire input domain. and if there's doubt about what exactly the domain of something will be good to have trouble for developers your faces across the trust manager need to be tested was all represent-able values, not just those from the extensible and put them in supercross and examples we looked at if we are writing webserver when i hope that everybody summits data as well formatted but it's most likely the case but they won't and the reason they want it will be trying to break in for webserver so we need to test on a kind of data to ensure that we can correctly rejected similarly puppies a softer like the links colonel have the trust boundary out the system call their faced as to say at the interface between use remote applications analytics colonel unix kernel like the webserver can't trust those clients we're gonna make well-formed requests all the time it's expect those clients if not actually hostile are at least buddy and read all sorts of crazy stuff and it's a catch that sort of crashing or by letting security policy little brute force goes a long ways on the test women in particular isn't certain restricted circumstances into exhaustive testing and almost anything else can you randomly tested quality kabi tested and about software we saw the factory five example with a control software for the radiation therapy machine was probably so broken that almost and almost no amount of testing with insufficient to make a great it'd be thrown away and they needed to start over i'm sure we often softer look like a so in contrast with examples like a fair trial five possible software has a few of the following politics noted coupling between models inside channels or models can share information without being visible to the system developers few variables their share between brands few global variables shared between models and no pointer soup minister said no huge data structures of players going everywhere we can't possibly keep track of who's changing what and what's valid and what's not could should be self checking whenever possible using plenty of assertions user shins are never used for air checking rather the used to check for logically impossible conditions that implies some sort of an internal consistency violation solutions must never be side-effect because if they are you turn them off the system behavioral change misleads the madness among developers finally the sessions can every sillier trivialize because first of all the serve no purpose second the clatter of code third if they make things slower biota failed to create a useful information the next person who looks at the code when appropriate all three sources with the to appease a softer under test should be used as a basis for testing those included the obviously p_r_i_'s provided by the softer under test which can be tested directly it is used by the software under test commit tested using fault injection techniques sore call that these are things like substituting the library the provides the c_p_i_ is with a different library linz x faults or perhaps just happen belair underneath conduct faults finally non-functional inputs such as presidential zz they should be tested using whatever method you can get borked actually testing site and finally last principle for testing the failed code coverage do not provide a mandate to cover the filled items no matter how attending that might be but rather they give clues two ways in which yes we did not put sublime according to the coverage metric is gonna destroy those clues that's gonna do it in such a way that doesn't improve the quality of the tests we very much so taken together listserv items that are just giving you conference is pretty much all that i know about testing and the detailed version of the is has been the content of this course use material but i have never taught before so i hope it came out and sort of a fairly good here in fashion it stuff that's been brewing in my mind for a long time i wanted to teach it because have those for years but we don't seem to be doing a very good job teaching steer students attached what happens instead is they right small test cases in response to assignments they divide them telecast the test cases we give them and the man than ever look at them again as hard to think of anything less like the real world of softer development then the environment we cream profit for trying to do is structures course a little bit differently photographing the life you are really important but we often don't look very good job with who's been really enjoyable for me it's been great actually try to set this material down in the content passion every much hope that this material has been useful for u and at the classes but enjoyable thank you.