ud712 ยป

Contents

Cleaning up State Orphan Processes

Now don't worry. This is not your computer misbehavior. How many times have you see often windows like what I'm showing you on this screen? Why does this happen? The simple answer, programs are not being hygienic, that is, they are not cleaning up after themselves. Now, you can not fault the applications themselves. If an application encounters an error or the user decides to kill an application, either ways, the system services that the application was using have to have a way of gracefully terminating and getting rid of the state, that is the bread crumbs it created along the way. And it may have strewn such breadcrumbs all over the computer. Some may be in data structures in memory, some may be visible, like the orphan windows that I'm showing you on the screen, and so on. Bottom line is these breadcrumbs are using up precious resources in the computer. It may be visible resources like real estate, on your display of failed applications or invisible resources like memory leaks, network bandwidth that is being chewed up by connections that are persisting beyod the life of an application. Persistent data structures on the disk and so on. LRVN and [UNKNOWN] which is really and implementation of the RVN semantics on top of a persistent file cache, took a very narrow view of recoverability, namely recovering the state that need to be persistent across system crashes. Either becaue of software failure or power failure. But imagine a system service that spans several machines and network file server is a good example. NFS is suppoed to be stateless which means that the client-server interaction maintains no state at the server, pertaining to the clients. So what that means is, taking down a file server and bringing up The file server again does not need any coordination with the clients that this file server may be serving at any point of time. But look at what is happening from the point of view of the system, especially the client boxes that are interacting with the file system. In this case, there are clients all over the network so far as this file server is concerned. And indirectly, and if a server is leaving breadcrumbs all over the LAN because of accesses that a client may have made with a file server, in each one of these boxes. Imagine what happens when a client program that initiated a file system call quits in the middle of its exchange with the file server? How will the file server know how to clean up all the mess? And in fact, with a stateless file server, doesn't even know that there have been breadcrumbs created on this node. The short answer is, the file system cannot know about these kinds of breadcrumbs that have been created all over the network. And I'm giving you a file server as one example, but in general, all of the system services that a client program is reliant on create state, and this partial states may live forever if an application crashes in the middle. Worse yet, you as a user may be asked by the operating system to do some cleanup, not knowing what the cleanup may entail for you, personally. Am I going to lose two hours of work that I just completed? I have no idea what will happen when I click the "OK" button here.

Introduction

So Quicksilver asked this question, if recoverability is so critical for so many subsystems, shouldn't recovery be a first class citizen in the design of operating system and not an afterthought? Because LRVM and Riovista are trying to fix a problem that manifested itself because the operating system was doing it's job well. So the question that Quicksilver is asking is. Can we make recovery a first class citizen in the design of the operating system?

Quiz Introduction

Users of a system want the cake and to eat it too. They want good performance from the operating system, but they also want the operating system to be robust and recover from such failures. That is, they want the operating system to be reliable. Conventional wisdom is that performance and reliability are opposing concerns. That is, you can have one or the other, but not both. Quick Silver's approach is that if recovery is taken seriously from the get-go, you can design the system to be robust to failures without losing much on performance. Let's have another trivia quiz.

Quicksilver

So from the introduction that I gave you about Quicksilver you can see that it has identified many problems that you and I face with our everyday computing. With orphaned windows and, and memory leaks and so on. So the question to you is, name the company and the year that Quicksilver was built. And the choices I'm giving you in terms of companies include IBM, Sun, HP, Apple, Microsoft, and Google. And the choice of years I'm giving you is, the early part of 80s, early part of 90s, early part of 2000s, or 2013. So pick your choice for the company and the year in which Quicksilver system was built.

Quicksilver

This might come as a surprise to you, because these are problems that we are facing even today, but Quicksilver was done by IBM back in the early' 80s. To be precise, 1984 was the year in which Quicksilver Project was started at IBM.

Distributed System Structure

If you look at the structure of distributed systems today, what you'll see is there are applications in the applications that users use. And there are system services and there is a micro kernel that's sitting at the bottom. And the system services include things like file server, web server, Window manager, database manager, and the network stack for applications to do network communication with servers that may be on remote machines. And the micro kernel itself may be responsible for process management and managing the hardware resources and providing inter-process communication, both intra-machine among these services. As well as inter-machine through the network stack, the remote machines on the local area network. We've seen that this is a structure we like to see in the operating system. And in fact, from many of the previous lessons that we have looked at in operating system design and implementation, both for a single processor, for a multi-processor, and a networked operating system, this kind of structure lends itself to extensibility and yet high performance.

Quicksilver System Architecture

As I mentioned earlier, Quicksilver was done in the early 80s, and here is a sketch reproduced from their paper. An interesting thing that you notice is that this sketch is very similar to what I showed you as the current structure of network operating systems in the earlier panel. It's a microkernel based design and Quicksilver was the same vintage as Mach from CMU. And having seen the structure of operating systems in earlier lessons, this picture should be very familiar to you, that all the system services are provided above the microkernel. And the microkernel is only responsible for Process Management, IPC, and Machine Control, and all of the system services such as Window Manager, File System, Virtual Memory, and Communication, all of them sit above the microkernel, and the system services are implemented as server processes. Now, the late 80's was exciting time for distributed systems. There were no laptops in that time. We were moving from CRT terminals, Cathode Ray Terminals, connected to mainframe to the era of office workstations, what we routinely call as desktops these days. Quicksilver was conceived as a workstation operating system. And the ideas that are enshrined in the Quicksilver operating system predates over concurrent with many things that we take for granted today, such as network file system, remote procedure call, the Internet, the World Wide Web, and so on. What sets Quicksilver apart from the competition at that time when new services such as window manager for managing the real estate on the workstation screen and how it should be integrated into the operating system as a whole? Services that may not be available in the workstation itself, for example, a file server maybe remote. So integrating communication into the design of the operating system was key and services that are within a workstation and across workstation on the network in the distributed system need to recover from failures. Rather than ad hoc mechanism for each server to recover from such failures, Quicksilver was the first to propose transaction as a unifying concept for recovery management of the servers. It was the first operating system to propose transactions in operating systems and that's the reason you see transaction manager as part of the service provided by the operating system. A quick personal note, I was a graduate student at University of Wisconsin, Madison in the early '80s, and spent a year in IBM research working with the group that conceived and designed this Quicksilver operating system. My dissertation research, which was on hardware support for inter process communication, used as an experimental platform, a precursor to the Quicksilver operating system which is called that Quicksilver was intended as a workstation operating system, and 925 was the name of the operating system that was a precursor to Quicksilver. And as you can imagine this is a pun on office workstation nine to five. That was the name given to internally to the precursor of the Quicksilver operating system.

IPC Fundamental to System Services

Because Quicksilver is a distributed operating system, IPC both within, and on the local data network is a crucial component of Quicksilver. And this picture shows the semantics of the IPC call. In the kernel, there is a data structure called service queue. Which is created by the server that wants the service, request from clients. And clients make a request, and the kernel does an upcall to the server to indicate that this is a client's request. The server executes the upcall associated with this particular request. When it completes the request, the completion, goes back into the service queue. And that is an indication for the kernel, to give a response back to the client. So then the synchronous client call where the client is waiting, til the request is actually serviced, and the completion response comes back to the client. And the service queue, is a global service queue, just like UNIX socket. So any process, anywhere in the network, which has knowledge about the service queue, can connect to it and make requests on the service queue. And so nearly any server process in the entire distributed system can service requests that are coming into the service queue. And there are some fundamental guarantees provided by Quicksilver for interprocess communication which includes no loss, or duplication of requests. So the request comes in, it will get done exactly once. And it also ensures that there's no duplication. It also ensures that there is no loss of the request. And Quicksilver also takes care of the liability of the data transfer that is inherent when the client and the server, are on remote machines. And because the service queue data structure is globally unique for every such service. There is location transparency, for client server interactions. Or in other words a client does not needs to know, where in the network its particular request is being serviced. For that is yet another feature of the IPC guarantee. Or the IPC call can also be asynchronous. What that means is, the client can make a request asynchronously. And continue with its own execution, whatever it wants to do, it doesn't have a block on this. The kernel is going to take the same action, and that is, if there is a server that is available, then the kernel is going to pass it to that server to execute that request. And, when the completion comes back in, it is buffered in the service queue by the kernel, waiting for the client to come back, and ask for the response. So the client, at some point, has to do a wait on the service queue to indicate that I'm ready to receive the response that may have come, back for the request that I made earlier. And when the client does the wait, if the original request has already been serviced by the server, and the response is sitting in the service queue, then the kernel, will deliver the response to the client. If not, the client will wait until the response comes back. So this is the asynchronous client call, but in either case, as I mentioned earlier. The IPC guarantees hold that there is no loss of the request, and there is no duplication of the request. As you can see from the semantics that I described just now, that Quicksilver IPC is very similar to remote procedure call. In fact, the remote procedure call paradigm was invented around the same time as the Quicksilver Operating System. And since all services are contained in several processes, IPC is fundamental to Quicksilver. The IPC semantic supported by Quicksilver allows, multiple servers to wait on a service queue. And the way they will do that is by making a call called offer which is essentially saying, I'm willing to offer my services for this particular service queue. Any number of servers can make this offer and that essentially means that if a request comes in, thatany one of these servers can be called by the kernel depending on the busyness of the servers with respect to handling requests that have come in for the service queue in the past. The client server relationship is interchangeable. For example, a client can make a call on a file system server and the file system server in turn makes a call. To a directory server and a call to a data server. So, in this case, the file system becomes the client to the directory server and the data server. So in that sense, the client server relationship is interchangeable. Now the only reason for me to spend some time describing the IPC semantics of Quicksilver. Is because the recovery mechanism is tied intimately with the IPC. And in fact, that's how you can have the cake and eat it too in Quicksilver. In other words, the client server interactions have to use IPC. So the recovery mechanism, using transactions, rides on top of the IPC, essentially bundling the recovery mechanism with ICP to get it cheaply. Another interesting footnote I wanted to mention. The Quicksilver system was first conceived in the early 80s, but the first paper that described it appeared in 1988. And this is certainly the difference between academic research and industrial research at least in the olden days. Academic research, we tend to shout often. I'm an academic myself, so I take part of the blame. At least in the olden days, industrial research used to take the approach of publishing, a paper, especially in systems designed, when it is fully cooked. Like I said, Quicksilver was designed and implemented in the early 80s, 1984 to 88, but the first paper came out in 1988. But nowadays I have to mention that everyone is shouting often, which explains the proliferation of conferences that you see around the country and the world.

Bundling Distributed IPC and X Actions

Let's understand how Quicksilver bundles IPC with recovery management. The secret sauce for recovery management is the notion of transaction. It's a light weight version of transaction. Not the heavyweight version that you normally associate with databases. In that sense, the notion of a transaction in Quicksilver is very similar to what we saw in LRVM. In fact, it is perhaps fair to say that LRVM inherits the semantics of transactions similar to what was proposed in the Quicksilver, because Quicksilver predates LRVM. And the IPC calls a tag with transaction ID. Let's say a client makes a call to a server, an IPC call to a server. Now, because of location transparency, the client and the servers can be on different machines in the entire local area network. Under the cover, the client contacts the Quicksilver kernel on the node that it is on, let's say node A, which in turn contacts the communication manager that's part of the operating system of Quicksilver and implemented as a server process above the kernel. Now, this communication manager on node A contacts the communication manager on node B via the kernel on node B. So all of this is happening under the covers when a client makes an IPC call to the server. Depending on the nature of the client-server relationship there's going to be state associated with this client-server interaction. Further the communication manager may itself have state when it is communicating with its peer on a different node of the network. We would like to make sure that the state that is associated with the communication manager as well as the state associated with the high level client-server relationship that I am showing you, all of these are recoverable from failures. And the failures may be things like link failure or crashing of the server. Any of these will result in state being left behind and what we would like to make sure is that all such state that is left behind in the entire distributed system is recoverable. So under the cover the communication manager on node A contacts its transaction manager, and the transaction manager on node A in turn contacts its peer on node B. And now a transaction link is established as a way of recording the trail of client-server interactions to later facilitate picking up the breadcrumbs that may potentially be left behind when the client-server interaction terminates either successfully, or unsuccessfully. So, the creator of the transaction is the default owner of the transaction, that is the root of the transaction T. Others are participants. In this example I'm just showing you that a client-server relationship exists here. The transaction manager contacts its peer, and this guy says yes I'm willing to participate in this transaction. So this is the transactional tree that gets established as a by-product of the original client-server interaction. So the owner, that is the root of the transaction, is the coordinator for the transaction tree that gets established. And as we will see shortly this transaction tree can span several sites, because a transaction that starts here, goes to the server, may go to other servers. So, the creator of the transaction is the owner of the transaction. But the owner can also change ownership as well as change the coordinator for this transaction, and we'll see how that is useful in a minute. The client-servers can choose to remain completely unaware of transactions if they so choose. In other words, the mechanisms are there provided by the Quicksilver Operating System, but it is up to each service provider whether or not to use them. And there is no extra overhead for the communication that I am showing you between these transaction managers on these different nodes, because it happens naturally as part of the IPC that has to happen anyway in the distributed system. That is the communication that is needed for these transaction managers to handshake, to say that yes, I've received your transaction request, I want to participate in it, that is piggybacked on the normal communication that happens to support this interprocess communication between the client and the server, through the communication managers on these different nodes. They are piggybacked on top of regular IPC. So there is no extra overhead for the communication among the transaction managers. So a chain of client-server interactions leads to a transaction tree as client-server interactions can span multiple nodes, or multiple sites, on the local area network. So there is the the root of the transaction tree, who is the owner that initiated the original transaction through the IPC call, and these are all the participants who said that yes, we are also part of this IPC chain, and we will participate in this transaction that was initiated by this owner. Examples of how these client-server interactions lead to the creation of transaction trees. You can have a client making a call to a window manager, asking for something to be painted on the screen. And that under the covers will result in a transaction link between these two nodes. Or a client could make a request to a file server for opening a file. That would result in a transaction tree getting established between the participating nodes. And you can see in these examples what kind of breadcrumbs will get created on behalf of the client. The window manager may have opened up a window on the display and that's a piece of breadcrumb that had been created on behalf of this client. And similarly, the file server may have opened a file and kept some pointers to where this client is in that particular file. That's part of the breadcrumb that the file server is creating on behalf of this client. And those are the states that we would like them to be recoverable if in fact there is any failure. I mentioned earlier that IPC calls are tagged with the transaction ID that gets created under the covers. And these transaction IDs are automatic. The clients and the servers don't have to do anything special for that. But at the same time, they don't have to care about them either, if they don't want to use it in any fashion or form. The key point is the transactions are provided in the operating system from the point of view of recovery. And because a client-server relationship can traverse multiple nodes or sites in the local area network, it is important to make sure that recoverability has to worry about multi-site atomicity. A transaction that originated in one node may traverse several different nodes and leave breadcrumbs all over the place. All those have to be cleared up when a transaction terminates. So there is coordination that needs to be done, and this is where the transaction tree is very useful. And since the transaction that is used as a secret sauce in Quicksilver is purely for recovery management, semantics are very simple and there is no worry about concurrency control either.

Distributed Transaction

By the very nature, transactions are distributed at each node, and the transaction manager is responsible for all the clients over interactions that touches that particular node. For example, the client at this node could have done the following, he could have opened the window for reading Gmail. It could of opened a window in which is accessing file for editing,. Now the transaction manager node A, has to manage the transaction keys corresponding to each of the client server interactions, that is initiated by this client. Opening a window for Gmail is one interaction. Opening another window. Which is accessing a file for everything with another interaction. The transaction manager is responsible for maintaining the transaction trees for each one of those separately. And the transaction manager where the client server interaction originates. Is the owner as well as the coordinator. But it is designatable to some of the node as I mentioned already. And there is a graph structure for the transaction tree. And this is very useful in terms of reducing network communication because all the transaction manages. That form this interaction, don't have to always communicate with the coordinator. For instance, the transaction managers at nodes C and D, have been contacted because of IPC that originated at node B, to node C and D respectively. In that case, they only have to report To node B and they don't have to necessarily report to coordinator of the whole transaction sheet. So, this is helpful in reducing the amount of network communication. All transaction managers are not equal. I mentioned earlier that Brittle nodes in the system are the client nodes, and therefore, it is possible that a transaction manager, that originated at a client node may designate the coordinator to be a more robust node like the file server. And there are different kind of failures that can happen. There could be a failure of a participant node In the transaction. Or there could be a connection failure. Or it is possible that one of the subordinate transaction manager failed to report. All of these are sources of failure. Now, one of the things that the transaction manager at each node has to do, is to log periodically to persistent store, state that is created on behalf of the client. Or on behalf of the server whatever is happening at that node this transaction manager is responsible for creating checkpoint records for recovability reasons. And these checkpoint records will be useful for warding off against failures or partial recovery of work. So the distributed system failures can happen at any point. If for instance this node fails, then the transaction manager of this node has also failed, and this is something that this transaction manager is going to find out about because it doesn't hear any response from this transaction manager, but a transaction represented by this graph here, is not aborted at the first indication of failure of a node. And the reason is because you don't want error reporting to stop as a result of failure. You want the transaction to be aborted only upon termination as requested by the transaction manager, the coordinator of the transaction tree. And the reason is as I said, is to make sure that partial failures,they may have left states. You want to clean up all of that, and that will happen when a coordinator initiates termination of the transaction. We'll see that in a minute.

Commit Initiated by Coordinator

When the client server relationship that resulted in this formation of a transaction tree, completes it's action, for example, let's say that, the transaction tree got created as a result of a client opening a file, doing a bunch of reads of the file, writes of the file, and finally closes the file. At that point, that interaction between the client and the server is complete. That's when the transaction tree, that has been created as a shadow of the original client's relationship, would get into gear and say, okay, now it is time to clean up any resources that may have been partially created. In order to satisfy that original client-server relationship. And it is the coordinator that initiates the termination of a transaction and the termination initiated by the coordinator can be for commit, or it can be an abort. And the different color-coded arrows show you what are all the kinds of commands that the coordinator may issue to its subordinates. Who in turn will issue the same commands to their subordinates. If the transaction manager, the coordinator, decides that it is time to commit, it might ask for vote request on committee. Or it might ask for an abort request. Or it might send an end of commit or abort. These are all the communications that would be initiated by the coordinator of the transaction. And correspondingly, responses will come back up the tree commensurate with the request that was given. So using again the file service as an example, if a client that started a file service request crashed for some reason, then the transaction manager that is a coordinator for their client server relationship will then send an abort request to all the participating transaction managers that got touched by that particular client, as a result of that file service request. For instance it could a directory manager, it could be data servers that are hosting those files and so on. All those transaction managers will get contacted by the coordinator through this transaction tree with a request to abort that particular transaction ID. And when they get that request, these transaction managers can do local cleanup, whatever that might mean. And indicate to the transaction manager by response that yes, we have done what it needs to be done, to clean up the interaction that was started by that failed IPC chain. All the different kinds of communication that I've indicated here are opportunities to tailor the commit protocols, depending on the criticality of the states. That is, the nature of the bread crumbs that are going to be left behind in the different servers. So, for instance, if the transaction tree is representing a client-server relationship between a client that opened a window using the window manager on the screen. Then if the client crashes, then the transaction manager who was the coordinator for that particular client will indicate to the transaction manager at the window manager's site that this particular transaction ID is aborting. And in that case, the window manager can simply clean up the state, because the state that it has got is a volatile state. Namely, a window that it created on behalf of the client, so it can take care of it internally. It need not be persisted. On the other hand, if the interaction that we're talking about involves persistent data structure, for instance through a file server, then the file server may have to take more complicated action. So that is what this transaction tree allows you to do depending on the nature of the client server relationship. You can choose to use different commit protocols, depending on the criticality of the states that are associated and the nature of the breadcrumbs that may be left behind in the different sites, because of the client-server relationships. Persistent servers such as a file system may need a sophisticated commit processing, such as a two-phase commit protocol while window manager will only need a simple one phase commit protocol. So all of those are possible with the structure of the transaction tree and the kinds of requests that flow through the transaction tree both down the tree and up the tree in response. And this is the heavy lifting that is done by the operating system in order to support different classes of services that might exist in the operating system, which may require different recovery management.

Upshot of Bundling IPC and Recovery

The upshot of bundling the IPC and recovery management is that a service can safely collect all the breadcrumbs it left behind in all the places that it touched through the course of its service. Examples of such breadcrumbs include memory that was allocated but not reclaimed, file handles, communication handles, or often windows that have been created on the display. All of these are breadcrumbs that may be left behind by failed clients or servers. And the fact that you have a transaction tree that records the trail of all the nodes that were touched and the temporary states that were created in all these nodes allows these breadcrumbs to be cleanly reclaimed. And there is no extra communication for the recovery management itself, because whatever communication is needed for the transaction managers to talk to one another in participating in the transaction that is shadowing the IPC, actually rides on the IPC itself, and therefore it comes for free. And Quicksilver only provides the mechanisms. The policy is entirely up to each service. And in fact, services can simply ignore these mechanisms if they don't need any recovery management, or choose a policy that is commensurate with the type of service that it is providing. Because there's a variety of mechanisms available in the operating system, simple services may choose to use low-overhead mechanisms. Whereas a more involved service such as a file server may use weighty mechanisms in order to recover from failures. The overhead for recovery management in Quicksilver is very similar to what we already saw in LRVM. The transaction managers have to write log records at every one of these nodes. Because they are handling the interactions between clients or servers at this node with respect to other sites, the transaction managers have to write log records for recovering persistent state. Similar to LRVM, Quicksilver also uses in-memory logs for the transaction managers. And these transaction managers flush the in-memory logs that they create to record persistent states to disk periodically. How often these in-memory log records are written out to the disk by the transaction manager is a performance factor, and it is a window of vulnerability just as in LRVM. Ideally, if you want to be absolutely sure that you can recover from failures, whenever any persistent state is modified and a log record is written by the transaction manager, that log record should be committed to the storage. But that costs synchronous I/O, and so if you're worried about that, you might want to do that more opportunistically, assuming there are not going to be too many failures. So there is a performance vulnerability trade off in how often the transaction manager writes log records to the storage device.

Implementation Notes

So one of the key aspects of Quicksilver implementation is Log Maintenance. The transaction managers write the log records for recovering the persistent state. They're written in the log records that are in memory data structure of the transaction manager, and every so often, the transaction managers do a Log Force. Of the in memory log segment to the storage, for persistence. And the frequency of log force impacts performance, of course, because they are synchronous I/O operations. And log force can also be initiated by an application if it is concerned about about its persistent state. Getting onto storage but in Quicksilver an app has to be very careful about often they do a log for us. Because log maintenance is done by the transaction manager at a site, for all of the processes that are running at that node. And so the large record in Quicksilver actually contains all the modifications to persistent state. Required by all the processes that are running at this node. And therefore, if an individual client at a node decides to do a log force, it is actually impacting the performance of not only that particular client, but all the other clients as well. Therefore, services have to be very careful about choosing the mechanisms That is available in operating systems. Commensurate with their recovery requirments.

Conclusion

The intent in this lesson is to give you a feel for how some enduring concepts stand the test of time. The ideas in Quicksilver, namely, using transactions as a fundamental operating system mechanism to bundle and stage recovery of operating system services. Found resurgence in the nineties, in the LRVM work that we discussed earlier for providing persistence. Again, in 2010, it found resurgence in the form of providing safeguard against system vulnerability and malicious attacks on the system. And another research operating systems called Texas. We will mention Texas when we cover a later lesson module on system security. What about the computer industry and the commercial operating systems? Well, they always focused on performance. Reliability always takes the back seat. You will be amazed, what goes on under the covers of an operating system to gain performance. Just as an example, you write to a file, and you think it is on the hard disk. Well, think again. It is a while before your file write actually gets persisted on the disk. What if the system crashes in the meanwhile? Well, too bad. Well things may change in the future. There are new kinds of memories called Storage class memories, that have latency properties that are similar to a DRAM and are yet non-volatile. Will this new technology result in a resurgence of exploring transactions once more in operating systems? Only time will tell.