cs259 »

CS259 Lesson6

01 l Bug Management

Welcome back to the debugging course. In the past units we have discussed how to reproduce bugs, how to track their origins, how to simplify them, and how to find the defect that causes a failure. Today we will consider the management side of bugs. There is how to track bugs, how to organize the debugging process, how to make sure that bugs don't reappear, and how to find for a project where and why bugs occur. And again, this scans with a dose of automation. So we look and build tools that help us automate these things. But first a little story from the trenches.

02 l GNU DDD

In 1992 I did my master's thesis on a tool that would take a program and visualize this in various graphical forms. I will produce correctly formatted text, flow charts, or Nassi–Shneiderman diagrams. And you'd even be able to edit these in an editor. Later I went on pursuing my Ph.D., which I completed in 1997 on a topic named configuration management with feature logic. The idea was to use description logic to model changes in variance and to detect inconsistencies. This was pretty cool, but the problem was that apparently no developer was willing to learn description logics. Plus, configuration management was essentially solved so I was disappointed. That's when a student of mine, Dorothea Litkehaus, came along, and we developed the idea to use my old library for visualizing programs to visualize data structures rather than programs. Finally, I would be doing something useful. The resulting tool became a debugger named DDD for data display debugger. This is what DDD looked like. It had a command line interface so you could enter arbitrary commands. You'd also see the source code. You would be able to set break points, and see the current execution position. The interesting thing however about DDD, as the name says data display debugger, was the ability to show data structures. For instance, up here we have visualized that pointer list, and now I can double click on the pointer to see where it points to. This is the element the pointer points to. I can check out what self points to. Obviously, it points to itself. I can look up the next value. Again, look up self, and again look up the next value and see that the whole thing becomes actually a linked list. Needless to say, as I keep on stepping through the program, and as the values change, the display would be updated automatically just as well. DDD eventually became free software, and was even adopted by the GNU project. So it became GNU DDD, and it became very popular with C and C++ programmers. Today there is even python support build into DDD, but I'm not maintaining this anymore so I don't know how well this works.

03 l Bug Reports

But being the maintainer of a popular software also means you have to provide lots of support, which meant that I got plenty of bug reports. These ranged from the absurd, as in DDD crashed, thought you'd like to know to the rather questionable, DDD crashed on our Cray. I have enclosed a core dump. Please be aware we're working on super-secret data here, so don't share. Attachment: Core dump 50 MB. Some of them were very questionable. DDD hangs on my new pets@home project. All files are enclosed. Attached: home directory. The home directory here actually included passwords and bank account information. So much for free software. On some days I could easily get dozens and dozens of such bug reports. Some with helpful information, many without. Of course, I would have been able to ask for more information on each of these, but then, you know, I wasn't exactly paid for the job. Finally, I would set up DDD such that whenever it crashed, it would ask for vital information that I needed to reproduce the failure.

04 l Vital Information

What is this vital information that needs to go in the bug report? In a 2008 study involving 165 developers from Apache, Eclipse, and Mozilla, the most important facts the developers needed were facts about the problem. First, the problem history. That is, steps needed to reproduce the problem. For instance, start preview and then open the attached file. Second, diagnostic information. That is, core dumps, meaning memory dumps of the final stage of the program before it crashed, stack traces, the functions that were active at the moment the failure occurred, or logs. Whatever the system has recorded about the final state, and what the program has logged so far. Next, the experienced behavior. This is what the user saw. For instance, preview crashed. Next, the expected behavior. This is helpful as a reality check. Does the user expect the same as the developer? Mostly, this is just the opposite of the experienced behavior. Finally, a one-line summary. This is typically the base for searching for a bug report, as well as for deciding the severity of a problem. For instance, preview crashes when opening PDF file.

05 q Most Crucial

Now for the quiz. One of these 5 facts allows to infer most, if not all of the others, and, therefore, it is the most crucial of these 5. Which one is it? Is it the problem history? Diagnostic information? The experienced behavior? The expected behavior? Or a one-line summary?

05 s Most Crucial

The one fact that allows to infer most, if not all other facts, is the problem history. First of all the problem history is crucial to reproduce the bug. If we repeat the steps, then we will hopefully find the same behavior in contrast to the expected behavior. We may also be able to summarize what's going on and possibly even observe the state at the moment the failure occurred, getting more information such as a core dump or the currently active functions. So, the problem history of all these facts is the most crucial and the most important in a bug report.

06 l Example Bug Report

Many operating systems allow you to submit bug reports more or less automatically as soon as a program crashes. This is the dialog that appears on a mac os system. Essentially Apple asks you just one single question, and that is again the steps necessary to reproduce the problem. Everything else can be deduced from that one. On top of that the bug report also includes problem details and system configuration, which is, for instance, the version number of the program that crashed, the current process, date and time, the functions that were active at the moment of the crash, as well as any hardware attached to the machine. Here are the steps needed to reproduce the problem, and we send the whole thing to Apple.

07 l Bug Databases

So, when you are managing a successful product, you will get many of these bug reports, and you must make sure that no bug report ever gets lost and also that all these bug reports eventually get handled in finite time. Before that you'll have to store these problem reports somewhere that you can classify them, mark them, and evaluate them. This is the talk of a problem database. Simply speaking, a problem database holds all problems that ever occurred with some product. Whenever somebody reports a problem, either though an automatic means or as a regular user, all of these problems get stored in the problem database. You may wonder how many of these bug reports end up with large companies over time. Actually, these problem databases are huge. I don't know how this is for Apple, but I've been the first non-Microsoft researcher ever to peek into the problem database of Microsoft. I can't tell you exact number of bug reports that are in there, but if you consider that Microsoft has sold 450 million of Windows 7 licenses. If every user just experiences one such crash per month, that's more than 37 million bug reports each year. This is only from people who actually click on the send button. I guess there's a number of people who run Microsoft but don't have a proper licence and prefer not to tell Microsoft about their problems.

09 q Bug Nr 915

Now for a quiz--When was bug number 915 filed? So, was this in 1992, 1998, 2002, or 2009?

09 s Bug Nr 915

Let's go and find this out--we simply enter the bug number in the search field and get bug number 915. So, when was it reported? This is one of the oldest bugs in Mozilla, and at this time in 2012 it still hasn't been fixed. That is after almost 14 years. Still this bug has quite some history. Hundreds and hundreds of people-- Well, let's make that dozens and dozens of people have made hundreds and hundreds of comments over time. This seems like a tough bug to fix. So, the correct answer here is 1998.

10 l Bug Report Fields

How are all these problems classified? Let's look on advanced search to see what the individual attibutes of a bug report are. Up here you can see the status of the problem. Is it a new problem or is it a resolved problem? If it's a resolved problem, you can look up the resolution here. Was it fixed, or is it something that won't be fixed? Further down you can see the version of Mozilla that is affected by the problem as well as the severity and the priority of the problem. What do these fields mean? Let's look at them in detail. So, what are all these fields? First there's severity, which describes the impact of the problem on the user. The most severe problems are classified as blockers or show stoppers. These are problems that effectively halt all further development. Then the are critical problems, major problems, minor problems, down to enhancement requests. Enhancement requests are issues a user would like to see in some future version. They are also stored in the problem database. Let's say that other problems are more severe at this point. Priority defines how soon the problem will be addressed. The problems with the highest priority will be addressed first. Problems with lower priority will be addressed later. The Bug ID is a unique identifier or number, which allows you to precisely identify a single problem. Comments are left by developers adding additional information about the bug or proposing fixes on how to address the bug. Finally notification. These are the stakeholders. Whoever is listed here will be notified automatically whenever a problem gets a new status.

11 q Fixed First

Now for a quiz.. Which problems gets fixed first? Is it those with the highest severity? Those with the highest priority? Those which have been around for the longest? Or those which are the easiest to fix? Pick your choice.

11 s Fixed First

As stated before, the priority defines how soon a bug will be fixed. Of course, it may be desirable if severe bugs were fixed first, but a severe bug may affect only a single user. Hundreds of other users may suffer for minor issues, which therefore may have to be addressed first. The same goes for bugs which have been around for the longest. Again, these may not be important for many users. Those which are the easiest to fix. Well, if a problem is easy to fix that certainly increases the chance to get fixed soon. Then again, these problems may not be important.

12 l SCCB

Many organizations use a software change control board, or SCCB to set priorities. This is a group of people who look into the problem database and take care of the handling of the problems. Such a group typically consists of developers, testers, and managers. What they do is the keep track of resolved and unresolved problems, assign a priority to individual problems, and then assign these problems to individual developers. As you can see, a problem database is a very important tool in the management of a product. All the discussion on all the features eventually is stored in here. You can even use it as a tool for requirements management. When the project starts you enter a single problem in here, which states the product is missing. Then you add up more requirements and more requirements, which eventually will have to be fulfilled. A board like the SCCB then decides who should take care of which issue and when.

13 l Problem Life Cycle

Suppose you just have filled a new problem report. What happens next? In a problem database like Bugzilla, the problem goes through a number of stages. Initially the problem report is unconfirmed. If all the information in the problem report is valid, then it goes into the new state. A manager or the software change control board assigns the bug to an individual developer who now works on it. The developer now resolves the problem, and for resolving the developer can choose between multiple resolutions. be fixed, meaning that the problem has actually been addressed. The problem can be marked as a duplicate, meaning that the problem already exists somewhere else in the database and therefore possibly somebody else is already working on it. The problem can have a resolution of invalid, meaning that the problem is not a problem or does not contain the relevant facts. A resolution of won't fix means that the problem will never be fixed, which is a somewhat sad outcome for the one who originally submitted it. Then we have works for me as a resolution, meaning that the developer could not reproduce the problem. Note that if the bug report is invalid or a duplicate this may also be found out at an earlier stage of this, and the problem immediately gets resolved, of sorts. If the resolution is fixed, then the fix will typically be verified by the quality assurance team and as soon as the final product finally ships with the fix in it, then the bug report is marked as closed. In case the problem reoccurs again, it goes into a state of reopened and then needs to be reassigned to a developer. This can also happen from the resolved state. If additional information becomes available, for instance, that makes the original resolution obsolete.

14 s Which Stage

Now for the answer. The user has reported the bug, and Dora has confirmed it is valid. It got assigned to Erol, but Erol now already has fixed the bug. So, we're currently in a resolved state. What has not happened yes is that somebody else has verified the fix,, and what also has not happened yet is that the fix actually shipped to the user-- for instance, by releasing a new product. Therefore, the correct answer is resolved.

15 l Housekeeping

As your problem database fills up with more and more problem reports over time you'll want to do some housekeeping. Because as these databases fill up, there are a number of issues that pile up as well. The first one is duplicates. If you have one user who's reporting a problem chances are that other users will be reporting just the same problem. That is, you have multiple problem reports that all relate to the same class of failures. These problem reports are call duplicates. As a manager, your task is to identify such duplicates. You want to do so in order to avoid them cluttering the statistics, but you also want the duplicates to refer to each other. This way when you come across a problem report, you will find, hey, this is a duplicate of this original bug report, and all of these others are also duplicates. You like to keep the duplicates, though, in your database, because all of these may report on different angles of the problem and these angles may all be helpful for resolving the problem. Note that automatic diagnosis mechanisms, such as statistical debugging or delta debugging, are great tools for identifying duplicates because they'll find commonalities between all the individual bug reports with respect to similar features in the input or in the execution. Next up is obsolete problems. Over time your database will fill up with unresolved problem reports-- problems could not reproduced or problems that may have been fixed in some later version and low-priority problems. Having thousands of unresolved problems will drag developers down. They clutter up searches in the database, and they are bad for the morale. A problem database that has plenty of obsolete problems is like an overflowing drawer of socks. You don't find the socks you need, and the drawer makes you feel guilty for not throwing away your old socks. What you should do is over time simply declare problem reports obsolete and thus get rid of socks you don't want anymore. When is a problem obsolete? A problem is obsolete if it will never be fixed. For instance, because the program is no longer supported or the problem is old and occurred only once or the problem is old and occurred only internally. You don't want to actually delete these problems, but you can tag them with an appropriate resolution. In Bugzilla, for instance, there is a special WONTFIX resolution for such obsolete problems. Finally, problems are not only stored in the problem database, but that may also be test cases, which reproduce the exact problem. As a rule of thumb, as soon as you do have a test case that reproduces the problem, the test case makes the problem report obsolete. That is, as soon as you have a test case you can actually put a special flag on the problem database that the problem is now being addressed by the test.

16 q Test vs Bug Report

This last point called for a quiz. Why is it better to have an automated test rather than a problem report? Is it that you can always check whether the problem persists by running the automated test? Is it that you can always reproduce the problem? If the test fails, you can start debugging right away? Or you can query as much additional information as you need? Check all that apply.

16 s Test vs Bug Report

With an automatic test, and this is the main advantage, you can always check whether the problem persists by running the automated test. An open automated test by definition reproduces the problem. If the test fails, yes, you can start debugging right away. And in the run you can query as much information as needed, because you can always reproduce it. So, all four apply.

17 l Defect Maps

So, we now have seen that whenever a user reports a problem in the problem database or a developer for that matter or anyone, eventually a developer or a team of developers will take a look at the problem and make an appropriate fix to the program. Such fixes are also stored in a database-- namely, a version database where all changes are stored. Such a version database is also called a change database, a repository, or the configuration management system, a version control system. Pick your choice. There are plenty of version control systems around these days which help storing these changes and the resulting versions. Since using a version database is the first thing to use in any kind of civilized software development, I will simply assume that you use such a thing on a daily basis anyway. An interesting thing happens, however, when you link the information from the problem database to the information from the version database. Let's assume that the problem database has a problem report #347 where it says remove<u>html<u>markup fails.</u></u> Let's assume the version database has recorded a change to function name remove<u>html<u>markup in precisely this location</u></u> with a comment that this now closes problem report #347, which is a change which may well have been made after the problem was initially submitted. We can now got and relate the change to the actual problem report, because the change message has the actual number of the problem report in here, and we can use that to retrieve the precise problem report. Since we also know where the change has been applied, namely in this part of the file, we now have a link from the problem database to a specific place in the code. This allows us for every piece of code to identify the problems that were associated with it. What we do is we take the piece of code, look at all the changes that were made, and look at the problems that these changes refer to. We can then, for instance, find that remove<u>html<u>markup</u></u> over the history of this very course has had three fixes until it finally worked. Three fixes until a function actually works is pretty bad. We should really worry about the quality of our coding.

18 l Large Defect Maps

The interesting thing is that we can do this for all parts of the program. For every single function in the program, we can look up the changes and find out which problems were addressed in that specific file or in that specific function. What we get this way is a defect count for every single location. That is, the number of problems that have been fixed in that very file or function. In 2007, my students and I built such a tool that would create such a mapping from the version databases and problem databases of open source programs. For instance, we would apply this on Firefox--the web browser-- in order to find out where the most defects were. More specifically, we would be looking at security defects-- that is, problems that relate to security issues. What we would get is precisely the location where the most security bugs would be.

19 q Insecure Package

What you see here is a representation of all the classes in Firefox. Every class here in this picture is a rectangle. The larger the rectangle, the more lines of code in that class. These rectangles are nested into folders and packages. The color of the rectangle indicates how many security issues have been fixed in that particular class. What we see here, for instance, is the document object model, which has a fair share of security fixes. But there is more areas with plenty of security issues. Here we have JavaScript. Here we do have HTML layout. Here we have a library for displaying the content. Now for the quiz. Which of these four packages has had the most security issues in the past? Is it JavaScript? Is it HTML layout? Is it the DOM? Or is the content base?

19 s Insecure Package

The is JavaScript. Down here you see more than a dozen classes that all have had plenty of security issues. This is the correct answer. One may wonder why is it that issues end up in a small number of places and that such large parts of the code remain without any issues. For JavaScript this is pretty clear. This piece holds the JavaScript interpreter, which interacts through many, many interfaces with the system. All of these are possible attack vectors. It's no surprise that JavaScript holds these many security issues. Plus an interpreter is notoriously hard to get right. HTML layout may come as a surprise. After all, this is just the rearranging of appropriate user interface elements on the screen. Why would there be security issues related to that? The reason is cross-site scripting. As soon as you layout multiple sources on one page, it may be that one sources tries to access elements of the other source, using your screen as a tunnel. These issues are right here within HTML layout. The document object model also allows accessing and manipulating individual elements. This again is an open door for security issues as well as for content base. I don't really know.

20 q Pareto Principle

We have created such distributions for several systems-- at Microsoft, at Google, at SAP, and on many open source programs. What we always found was the so-called Pareto principle-- that is, 20% of all modules contain 80% of the bugs. The numbers vary from project to project, but what we always found was there was a relatively small number of modules that would contain lots and lots of issues. Initially, we were just excited of being able to create such distributions more or less at the touch of a button, but as you look at these distributions you begin to wonder where do these bugs actually come from? Do these modules that actually are specifically bug prone have something in common? If they do have something in common, could we use this very feature to make predictions? We dug a bit deeper and checked a number of interesting features. The first question we ask is does the bug density correlate with the experience of the developers that wrote the programs. That is, possibly more experienced developers make fewer mistakes. For these questions I'm going to ask you for a guess on your side. These will not be rated. So, what do you think? Does bug density correlate with developer experience, yes or no?

20 s Pareto Principle

The answer to that question is yes. It correlates. The more experience, the higher the bug density. This may come as a surprise to you, but here is the story behind it. We mined the Eclipse bug database and check for the experience and contributions of the individual developers. It turned out that the Eclipse project lead, Eric Gamma, had the second highest defect density in his code across all the developers. Now, Eric Gamma is anything but a nobody. He gave the world unit tests. He gave the world design patterns. And he gave the world the Eclipse programming environment, always with a team of course, but still. Why would his bug density be the second highest across all Eclipse developers? The reason is simple. Suppose this is you. You have been assigned to fix a bug. You look at the problem and you find, oh, this is terribly hard. What you do is you delegate the problem to your boss, who is way more experienced than you, Now your boss is looking at the code and says, "Ahhhhh...this is something I can't handle." He delegates this to his lead, and this guy says, "Ah, this looks really, really, really hard. Only one person in the world can do that." This is Eric Gamma, the team leader, who has no one else to delegate to. Being that the team leader, he gets the toughest problems-- that is, those problems where the chances of screwing up are highest. Still, he is the man, because anybody else dealing with these problem would, on average, make a worse job than Eric Gamma. This is how the more experienced people get the tougher tasks, possibly introduce more defects, but still overall they are precisely the right persons to do the job.

21 q Predict the Future

If there are lots of bugs in a specific place, will there be more in the future? Remember that if we say we found lots of bugs in one place, we already fixed them. What's your guess?

21 s Predict the Future

It turns out that, yes, if there have been many bugs in one place, it is likely that there will be more bugs in the very same place. This is the more interesting, because all the bugs we see have been fixed, so you'd assume that over time there would be fewer bugs. But that's not the case. With bugs it's like fishing. You find plenty of bugs in one place one day. The next day fish will assemble at the same place again, possibly because fish like the same places again and again, except that for fish we have a good theory on how the reproduce. For bugs, we do not. There is a bit of a hypothesis though. The assumption is that the same factors that have contributed to bugs in the past will keep on contributing in the future. Generally speaking, past fixes and a bit less so past changes are good predictors for future fixes.

22 q Complexity

Next hypothesis--complexity. Is it so that complex code has more bugs than simple code? Can we use, say, code metrics to predict where the most bugs will be in the future? Pick your choice.

22 s Complexity

Well, the answer is complexity matters sometimes. Sometimes in the project there is a correlation between some complexity metric and the real number of bugs. But very frequently there isn't. Then in every project there is some other complexity metric that correlates, so it might just as well be random. So, no. Complexity is not related to bugs as found in production code.

23 q Well Tested

Next hypothesis--tests. You can measure how well-tested individual parts of your product are. If a piece of code has a high testing coverage, this means that it would be well-tested. The question is is code that is well-tested less buggy? Pick your choice.

23 s Well Tested

The answer to that question is no. Actually, the opposite is true. The more thoroughly a piece of code is tested, the more bugs it is likely to have. The story goes the other way around. Good managers have a good intuition of where the bugs are, and then when deciding on where and how to test they go for the locations where most bugs would be suspected. Therefore, in a program like Firefox for instance, there is lots and lots of testing that's being done for JavaScript, for instance. Therefore, this has high coverage, but still many, many bugs are left because testing apparently can't find all the bugs.

24 q Team Structure

The next interesting hypothesis--does the team structure have an influence of how many bugs are produced by that very team?

24 s Team Structure

The answer to this question is yes, or at least that team structure can matter. A very interesting study at Microsoft looked how teams that were responsible for individual features were compose. More specifically, the study would look for how distant the common manager of a team was. If there would be one direct common manager for all members of the team, that distance would be 1, and if their common manager would be, say, two levels removed, then the distance would be 2. If the lowest common manager of a team would be say, Steve Ballmer, the manager of Microsoft, then the distance would be 7 or 8 or whatever the number of management levels at Microsoft is. Now, it turned out that the higher this distance was the more defect prone the modules produced by that team were. The assumption is that if there is no joint management in a team where every decision first has to go up to the highest level and then back again, then this makes teamwork particularly hard. It also makes it hard to make decisions, and it makes it hard to create some common responsibility for the module you're working on. This study was made on Windows Vista. As a result of that study, Microsoft reorganized the teams for later versions of Windows such that situations like these would no longer occur.

25 q Problem Domain

Last hypothesis--is it the problem domain? That is, the domain of the problem that your module is trying to address? Pick your choice.

25 s Problem Domain

The answer to that is a clear yes. In studies of Firefox and Eclipse we found one specific feature of the code that dominated all others. These were the imports made by individual modules. That is, the other modules that the module in question would interact with. More specifically, whatever a module imported would determine its likelihood to have a defect. In Firefox, for instance, if you're module included nsIPrivateDOMEvent.h and nsReadableutils.h--that is, used these specific APIs or interacted with these specific APIs, then you're code would be doomed. Because 20 modules that also included these two files l had at least one security issue. Likewise, in Eclipse if you imported something that dealt with internal features of the compiler, your code would be 4-5 times as error prone as code that only dealt with a graphical user interface. Why is that so? Well, if you write import compiler internal, this means you're going to write some compiler code, and compiler code is more error prone than user interface code, in particular because if you worked with a user interface, most errors you make will be immediately visible to the human eye. Whereas if you deal with compiler internals, it's a long path from a bug in the compiler to a bug in the actual compiled program, which then, again, has to be executed in order to have the bug cause a failure. None of this cause needs to be discovered right away. All of these are reasons why this domain, namely the compiler, is way more error prone than the user interface.

26 l Recap on Sources

So, we have looked at individual developers and past bugs, at complexity, at tests, at team structure, and at the problem domain. Developers get assigned to tasks that are hard in the first place, and tasks that are hard call for more bugs. That is, also more past bugs. Also, more testing. All of this leads us to the domain as being the most important factor in determining where bugs actually come from. If the domain changes frequently, this will lead to more bugs. If the domain is complex in itself, such as JavaScript or in eclipse the compiler, this will lead to more bugs. If the domain is not well-defined-- for instance, because the team cannot agree on what to do-- then this also calls for more bugs. What we can do, though, is by looking at past bugs, identify which parts of the domain and possibly other influences correlate with past bugs. This may give us a handle on how to avoid such mistakes in the future.

27 l Look at Data

At the end of the day, what the area of mining for such information has found out, though, is that although there may be general rules there are also lots and lots of project specific features that correlate with bug density. Therefore, it is advisable to first take a look at your own defect data and then figure out what the hot spots are and think about possible ways to learn from past mistakes and improve things for the future.