ud712 ยป



Hello and welcome back to the next module of the operating systems course. And this module is on system recovery. System crashes can happen due to power failure, hardware, and software failures. As operating system designers, it is essential to understand how to build systems that can survive crashes. In this module, we will discuss technologies that will help us deal with system failures and recover from them. In this module, we will cover three systems. The first one, LRVM suggests providing a persistent virtual memory layer in support of system services. The second one RioVista, suggests how such a persistent layer can be implemented in a performance conscious way, with some ingenuity. The third one, Quicksilver, suggests a more radical approach. Making recovery a first class citizen in the design of operating systems.


The first question that popped to your mind whenever we suggest a solution approach is why? In particular, whose pain point are we trying to solve? It turns out that many operating system's subsystems need persistence. For example file system, it has metadata such as inodes that say where and how files are actually stored on the disc. And this data structure, which is the inode data structure, is a persistent data structure and while the operating system may manipulate that data structure by bringing it into main memory for performance reasons. Finally, it has to put it back on the disk in a persistent state once permanent changes have been made to that data structure. And that's the reason why persistence is very, very important for building several such subsystems. So, a file system I gave you as an example. Similarly runtime systems of many languages may require support for persistent objects that need to be stored on permanent storage. And all of these subsystems, what they would do is in order to make it efficient from the point of view of your processing, they will have in memory a cache copy of the persistent data that lives on permanent storage such as the disk. But, if and when changes are made to those data structures that are in memory, they have to be committed back to the permanent storage so that there is consistency for actions that are being taken by these subsystems. So how can we provide persistent support for these subsystems? Well, one possibility is to make virtual memory persistent, because if you make the virtual memory system persistent then all the data structures that are contained in the virtual memory become automatically persistent. Now the upshot of doing that would be that the subsystems do not have to worry about flushing the persistent data structures from memory into the disk because some other layer in the software stack is going to take care of it if in fact we make virtual memory persistent. That also means that if the system were to crash either due to power failure or due to problems in the software, meaning software bugs, recovering from such crashes becomes easy because the data structures are persistent and anything that was done in the virtual memory of the process that represents that subsystem is going to be available in the persistent storage. The question is then, who will use it? Well, we are expecting that the subsystem designers will use such an abstraction if we provide it. But subsystem designers also care about performance. And therefore, subsystem designers will use it only if the abstraction is a performant abstraction. In other words, it is not enough to come up with an abstraction, but it is also important to make the abstraction efficient. So if an abstraction is cheap to use, easy to understand, and flexible, then subsystem designers will use it. If it is not, they won't. So the question is, if we want to make virtual memory persistent, can we also make it an efficient implementation of the abstraction of a persistent virtual memory? Now how can we make it efficient? Now if we think about it, the virtual address base of a process, and let's say that this virtual address space represents a process that belongs to a particular subsystem. It may have persistent data structures that are strewn all over this address space. And if these data structures, persistent data structures, are written into and if these persistent data structures are manipulated by the subsystem and if we want them to be committed to the storage, then it requires that every time we touch a persistent data structure in-memory version of it and modify it, the on-disk version of that same data structure needs to be updated. Now what that means is if these data structures are spread out all over the virtual address space, that could result in many I/O operations, first of all. Second, it could also mean that we may be writing, to different portions of the disk. And both of these are bad news, because disk by it's very nature, it's electromechanical in nature. And, disk is electromechanical in nature. And as a result there are latencies associated with the disc, namely, seek latency and rotational latency. And if the writing that we are doing to virtual memory, which is backed up by the persistent disk, is all over the disk, then it's going to result in many random writes to the storage system. And random writes are going to be resulting in very poor performance. So instead of doing this, what we would like to do is have what we will call a log segment. And the idea is very similar to the log structure file system that we have seen earlier when we discussed XFS. The idea is that, when you're making changes to the virtual memory and you know what portions of it are persistent, every time you write to a persistent data structure that is in memory, then you write a log record that corresponds to the change that you made to the persistent data structure. So this is the change you made to this persistent data structure, and this is the change you made to this persistent data structure. The log segment is itself a data structure in support of these subsystems when implementing the persistent virtual memory. We have this log segment as a data structure that the implementation uses to record changes to persistent portions of the virtual address space. And because we are putting it into this data structure called log segment, we can write this contiguously. And of course, we have to commit these changes to the disk. But now you can see the difference in the picture between here and here. Here we were doing random writes all over the disk. But here we are writing the changes to the persistent data structures to contiguous portions of a log segment and we can take this and store this contiguously on the disk, and that's the attraction. So these log segments are going to be stored contiguously on the disk. And therefore, even though we are making random writes in the address space of the subsystem that we are building that requires certain persistent data structures, because those changes are being recorded as logs in a log segment, and we are committing the log segment to the disk we can convert these random writes to sequential writes and write sequentially on the disk. So two things we are accomplishing by doing that. The first is we are not making individual I/O operations for every time we touch different portions of the virtual address space, because we are writing it into this log segment, which may be an in-memory data structure. And therefore, we are reducing the number of I/O operations that we need in order to commit these changes finally to the storage. And second because we are writing this contiguously, we don't have the random writes that we talked about. All of this, will help make the design and implementation of a persistent virtual memory much more efficient. That is the key idea, which is there in the paper that I've assigned to you for reading, which is the LRVM paper.

Server Design

Now let's discuss how we will design a server that requires some persistent support using this idea of a persistent virtual memory. So this is the virtual address space of the server. And note that, not the entire address space of the server needs to be persistent because as the developer, the subsystem designer knows what data structures in his design need persistence. So for instance if you take a concrete example like a file server design. Then the inode data structures, which may be living on the disk are the things that the file system designer would want to make sure that they're persistent. So if an I node data structure, let's say M1 is mapped into a portion of the virtual address space, then it is manipulation of M1 in particular, updates to M1 that needs to be reflected in the backing store. These data structures, M1, M2 and so on, are data structures, which are in memory versions of persistent data structures that live on the disk. We will call the collection of data structures that need to be persistent on the disk as a data segment. And a application may choose to use multiple data segments. That corresponds to persistent objects that it needs to manipulate in the course of its execution. So these data structures M1, M2 and so on, are the persistent metadata that have an on disk copy, as well as an in-memory version. Whereas all the other stuff in the virtual address space of the server are normal code and data structures. So in terms of flexibility for a server design, it would make sense to allow an application to create as many data segments as it needs in order to support what it needs to do in it's design. So what we need to do is to allow an application to map an external data segment. By external data segment what we mean is that there's a data segment that is external to the virtual memory living on the persistent medium, in this case a disk, and the application is going to map this external data segment to selected portions of its address space. What we will help the application do is to explicitly map the regions of the virtual address space to data segments that live on the disk. And as I said to allow flexibility, the application designer may choose to have multiple external data segments and map these different data segments to different portions of the virtual address space. Or in other words, the application completely manages their own persistence needs, and all that we are providing is the ability to specify external data segments to back persistent data structures that we're going to manipulate in memory. So what the application would do is, at startup, it'll map these external data segments to selected portions of its virtual address space in order to create the in-memory versions of the data structures it needs to manipulate during the course of its execution. And the mapping between this virtual memory space and the external data segment is also one to one. That is there is no overlap of external data segments in terms of occupancy within the virtual address space and this makes the design of a reliable virtual memory, that much more simpler. So simplicity is the key in the design so that it is easy to use, flexible, and performing. And those are the goals that the authors of reliable virtual memory set out to accomplish. And just as mapping is done at startup at any point of time, the application can choose to unmap a portion of the virtual address space that is mapped to the external data segments. And we will see later on that opportune moments for doing this unmapping would be when no commits are pending. For these in memory data structures, and external representation of those data structures in data segments living on the disk.

RVM Primitives

Now we'll discuss the primitives provided by the lab of virtual memory for the app developer. Recall what I said earlier and that is we do not want the in memory data structures when modified to immediately start writing to the corresponding on disk version in the external data segments, because that would result in a lot of random rights. And therefore we suggested that what we want to do is use a log segment aggregate changes that we're making to the portions of the virtual address space so that the log segment can then be committed to the disk to record the changes that we're making to the virtual memory and so the first set of primitives that RVM provides is initialization. And in particular, this initialize primitive identifies the log segment to be used by the server process for recording changes to the persistent data structures of this process. And every process can declare its own log segment data structure for use in managing its persistence. So if I have a file system, it has it's own data structures. Because remember that RVM is not inside the operating system, but RVM is provided as a run time library in support of applications that lives on top of the operating system. So the library provides it's primitives and so initialize is allowing the process that is using this library to declare a log segment data structure, which will be the data structure into which RVM, in the course of execution of the process, will aggregate changes that this process is making to persistent data structures, so that later on those changes in the log segment can be committed to the desk. And even further, those changes can eventually be deflected in the data segments that those in memory versions of persistent data structures that present. The map primitive is the primitive that allows the application to say, what is the region of the virtual address space that I want mapped to an external data segment. I mentioned that there is a one to one correspondence between an address range and the virtual address space and the external data segments. So if I need to map different portions of the address space of the process to different data segments. I would execute multiple map calls to map different regions of my virtual address space to different external data segments. So this regional descriptor can contains both the address range that I want to be mapped as well as it names the external data segment. That corresponds to this particular address range. In unmap, thus delivers mainly it decouples the address range from the external data segment that it is associated with up. So in the body of the server code These are calls that an app developer would make. The begin transaction and the end transaction alerts the RVM run time that the application is about to make changes to persistent data structures between these begin and end transaction calls to the RVM library. And in fact, signals to the RVM library that the transaction has committed, meaning that all the changes that the application made in between begin and end to data structures that are persistent have to be flushed to the disk. That is, they have to be made persistent. On the other hand, a begin transaction could also end in an abort transaction which essentially signals to the RVM library that all the changes that the application made bound between begin transaction and abort transaction. Have to be thrown away by the RVM library and should not be committed to the disk, that is they should not be persistent. So the idea then is, between begin transaction and end transaction, the application developer is modifying the in-memory version of the persistent data structures And committing them to the persistency storage by calling this n transaction which is saying, commit my changes to the in memory versions to the persistent version on the disc. On the other hand, if the developer calls an abort transaction after a begin transaction then all the changes that he or she made to persistent data structures should be thrown by the RVM library and not persistent on the disc. The set range call is the very first thing that an app developer will do inside a begin transaction end transaction sequence. You can think of this code between begin transaction and end transaction like a critical section of the app developer's code base. And the first thing that happens within that critical section is a call to set range. What the set range is saying is even though an address region may be mapped to an external data segments, in this particular critical section by begin and end transaction, I'm going to modify only a portion of that address range, and that portion of the address range that I'm going to modify Is specified by the starting address and the size of that block that I'm going to modify. That's the purpose of the set-range call, which says, for this particular transaction, which I started here, for which RVM will return a unique transaction ID to me, when i make this call I'm going to use this transaction ID and tell RVM that for this particular transaction, I'm going modify only a block of memory starting at this address and bound by this address. That's the purpose of the set change code. So, it is only this portion of it that you are going to modify. And we will see how all of what I said is used by the RVM library in its implementation in a minute. But from the user's point of view, from the developer's point of view, this is all that developer needs to know and use in order to write his application that has persistent data structures. As simple as that. All the heavy lifting that is needed to accomplish the developers intent for persistence that are enshrined in these primitives is taken care of by the RVM run time which we will see in a minute.

RVM Primitives (cont)

As we said earlier, RVM has to be efficient, otherwise, no one will use it. The run time does not actually write the persistent data, that is specified through the set range call. Directly to the external data segments. Instead, it writes the changes that it made to the block of addresses, specified by the set range call as a redo log in a log segment that was named in the initialized call, and we know that the log segment is the trick that we mentioned earlier also to avoid. Random writes to the disc. The log segment is an in memory data structure of the RVM run time and once a transaction commits at that point, the log segment, that contains the changes that have been made to the in memory version of persistence data structures. Will be committed to the disc. Or, in other words, RVM writes the changes that the developer's making to persistent data structures in between begin and end transaction as redo logs end the log segment. And the redo log that has been written into the log segment is committed to the disc at the point of end transaction. On the other hand, if the transaction aborts, then you don't have to commit those reader logs to the disc.Now remember that the changes that we make to the perisistent data structrues in memory versions of those persistent data strutures Have to be eventually committed to the external data segments. Now that part is done lazily by the RBM system. It basically applies the redo logs that have been committed to the disk to the external data segments at opportune moments, and once it has applied those Redo logs from the log segment in to the external data segments. The redo logs can be thrown away. And this part is what is called truncation. So there are two parts to managing the log. One is at the point of commitment, the log has to be forced to the disc, because you want to persist it. And second, once it is comitted to the disc. You have to eventually apply it to the external data segments and at that point, you can truncate the logs. So the log is going to be applied to the external data segments and once those external data segments have been modified, then you can throw away those reader logs. So flushing the log segment to the disk, as well as Truncating the log segment once they have been applied to the external data segments that they represent are all done automatically by LRVM. So the application developer doesn't have to do anything other than initialize his virtual memory The log segment he wants to use and write the code that contains these critical sections bound by begin transaction and end transaction and abort transaction. But RVM also provides flush and truncate as primitives for flexibility in writing the application. Or in other words, the application Can if it chooses explicitly manage the persistence and application of the [UNKNOWN] logs to the data segments. Shortly we will mention some optimization specified through the commit mode to the RVM [UNKNOWN] by the application programmer. To enhance performance. One of those optimization features allows the RVM to defer flushing to the disc at the point of commit. So one of those optimization features in the commit mode is to tell the RVM. Do not flush the changes to the log segment commit point yet. I'll take care of it explicitly by flushing. So these additional primitives are for the developers if she so chooses to explicitly controlling flushing of the logs to the disk and also manage the truncation of the logs by explicitly applying it when she wants it To the external data segments to conserve space, because if you think about it all these log segments are occupying space on the disc. At the midpoint you're flushing the redo records in the log segments to the disc. And so you're filling up the disc with a lot of redo records in addition to the external data segments. The truncation is a way by which. The subsystem designer, can explicitly say, go ahead and apply these edulogs to the extremity segments and truncate the log. So as i said, these calls are there, purely for flexibility, if their designer so chooses to use these primitives for explicitly Controlling and reducing the log space. And these additional perimeters are available in RVM as a way by which the developer can administer the design of a subsystem and performance tune his subsystem for optimality. The main thing to take away is that the RVM design is simple, with a small set of primitives. Easy to grok for a developer and use in the design of a subsystem. Also, we've been using the word transaction. But the semantics of the transaction as used in RVM. Is much more restricted compared to the traditional use of the word transaction in the database literature. In particular, the transaction as proposed and used in the RVM library, is intended for recovery management. It does not need all the properties that are usually associated With traditional transaction, so called acid properties, namely atomicity, consistency, isolation, and durability. For example, a transaction as defined in RVM, does not allow for nested transactions. It has no support for concurrency control. But all of these are things that you normally associate as transaction for the key design objective. In RVM is to make it simple, and performant, and easy to use. And if there is need for concurrency control, it is something that the developer has to implement at a higher level in their system software. So in this sense. It is useful to remember what I said earlier that the code that exists between begin transaction and end transaction is like a critical section in the application code. And all that the developer is trying to signal through the critical section to the RVM library is that it is making changes to a portion Of an address range and that potion of the address range, it needs to be persistent if in fact that critical section bound by this n transaction commits. On the other hand, if that critical section aborts then those changes should be thrown away. That's the intent of a set of primitives provided by RVM and that's the intent of this transaction semantic.

How the Server Uses the Primitives

Now let's look at how a developer may use this primitus in building a sub system. The first part of the code is going to be the initialization portion where the developer is mapping the address space of his process to external segments. Chosen regions of the address base to external data segments. And also specifying what the log segment is going to be for this particular code base that he or she is writing. And in the body of the code, there are going to be regions where they want to manipulate persistent data structures. And for that portion of the code they're going to say begin transaction, end transaction and within there the first thing that we'll do is set the range to indicate what is the block of contiguous adversities that they plan to modify in this critical section. And of course this block of adversaries should be contained in the range that has been mapped to an external data segment. And once they have done that, then the rest of the code is normal code that they write in terms of manipulating data structures. So they may be writing to a data structure m1, which is really a metadata that needs to persisted. And, if that is the case it is important that the user ensure that this data structure m1 is contained in the range that they specified at the beginning. And similarly m2 if it is a another persistent data structure that they're modifiying, it better be contained in the range again that they set out in the beginning. And similarly, when they are done with all of the changes and they want to commit, they can call end transaction and at that point all the changes that they made to these data structures are going to be written as a redo log into the log segment. So the first thing that LRVM would do inside this transaction code, is when you execute the set range call, it says, aha, this is the portion of the address range that the developer is going to modify within this critical section. And it's possible that this transaction, which is beginning now, may commit or abort. If in fact it aborts, then I have to make sure that all the changes that are made to persist in data structures are thrown away at the point of abort. And therefore the first thing that LRVM does is create what is called an undo record, which is really a copy of the virtual address base starting at this base address for this number of bytes. That's the portion of the address base that the developer intends to modify within this critical section. So LRVM makes an undo record which is the original version of that portion of the address base. So this under record is an in memory copy of the virtual address base starting here, for some number of bites specified by this number of bytes. And this is a temporary record, and in fact LRVM will create this under record only if it is needed by this transaction semantic. In the bigger transaction, there is a mode specifier, that the user can specify to the RVM whether this particular transaction is going ever abort. So in other words, if the developer is absolutely certain. That his transaction is never going to abort, then he can specifiy and know restore mode for this transaction, which tells RVM, that, look this transaction is never going to abort, therefore no need for you to create an undo record. That's the intent of that. So, again, we want to make sure that these primitives are performant. This is one of the ways by which the application developer can make sure that LRVM does not do unnecessary work. And in this case, if this transaction is never going to abort, then there is no reason to create an undo record. So that's the idea behind the no restore mode in the begin transaction. In any event, if the transaction eventually commits, at that point LRVM with throw away this undo record. This undo record is meaningful only if the transaction aborts because in that case. What LRVM would do is restore the original version of this portion of the virtual address space by copying the undo record back into that space. So during the body of this critical section, when the application is modifying These in-memory version of persistent data structure, no action by LRVM. All these changes are happening directly to the virtual address space of that particular process exactly where these in-memory copy of persistent data structures are living.

How the Server Uses the Primitives (cont)

So finally, if the transaction commits by calling an end transaction, at that point, all the changes that have been made to persistent data structures have to be written to the log segment that records the redo logs for this application. So at this point, LRVM creates a redo log in memory of the changes That have been made to the persistent data structures. That is, this region that has been modified, it's going to be written as a redo log. See it doesn't know within that region, LRVM does not know within this region specified by base address and the length. Where exactly these data structures are contained. All it knows is that, this is the portion of the address space that is being modified in the critical section, that's why it's so important as a developer to make sure that the data structures that you manipulating within this region, that you have signalled LRVM. LRVM is basically thinking that there is a continuous set of addresses starting here for a certain length that may have been modified in this critical section. So, the log record that it writes is basically saying, here is the start address, and here is the number of bytes, and this is the new data that goes into this virtual address base, that's what LRVM is creating As a redo log in memory. Remember that redo log is itself a data structure of LVRM in memory which should not confuse the redo log with the external data segments. External data segments are the persistent versions of the in memory data structures. Now this redo log Is the changes that are being made to the in memory version of the persistent data structures, that have not should been committed to the disk in terms of internal data segments. It's now available at this point at the end of end transaction, it's available as a redo log entry. In the log segment which was initialized by this application. And the semantics of this transaction is, if it commits, then all of these changes are now available on persistent storage. So, what the LRVM library has to do is Not only create the redo log record, which is a data structure in memory so far as LRVM is concerned, but it also has to flush these redo logs to the disk at the point of commit. And only after that, we can assume that all the changes that have been made in this critical section Has been persistent on the disk, so he has to flush to the disk synchronously, meaning that this end transaction waits for this redo log to be flushed to the disk. At this point it is on the disk, however, again [LAUGH] in order to make sure that we can have a perfomance implementation of LRVM, there is a mode available in the end transaction, and this mode. Says no flush, meaning at the point of end transaction, you don't neccessarily have to block the caller for the flush to be complete. Transaction semantics would require that the process that is executing this commit should not be allowed to go past this point until. The synchronised IO has completed from the redo log into the disk but in order to make it a more performance conscious design, if you think that power failures array and the chance is that your server is going to crash is not very high, then you can go ahead and say as a developer that At the point of end transaction, I want you to committed by the way, you don't have to block me. In other words, this mode if it says no flush, it is saying that yes, I want you committed to the disk, but don't block me from going further. So, the changes that are being made As a redo log record will be committed to the disk later depending on this specification of no flush. So as a developer if I say no flush, then the redo log is not going to be synchronously written to the disk. So I can go ahead. I might do another transaction. That might write More log records, so I can review the number of I/O operations in committing these log records to the disk. So that's an opportunity that I'm exploiting by giving this no flush mode in the end transaction. So, it's an opportunity for the application to both reduce the number of I/O operations. And also make sure that the application is not blocked here, waiting for the synchronous rights to the disk to complete. So once the transaction is committed, meaning that LRVM has created the redo log for this particular transaction, then the undo record is no longer needed because undo record, if you recall, was created Just for the eventuality that this transaction may not commit, but now that the transaction is committed, we can throw away this undo record. On the other hand, instead of the end transaction, the transaction may actually abort. If it aborts, then what. LRVM has to do is restore this portion of the virtual address space of this process from the undo record so that we have gone back to the state before this transaction ever happened. So in other words, we are making whatever code that the server executed as a critical section code between begin transaction and abort transaction. To go away, and we restore the computation to its state before the begin transaction by copying the undo record into the portion of the virtual data space that has been modified through this critical section code. That in a nutshell is how you would use the primitives provided by LRVM in constructing A server, that, has certain persistence requirement.

Transaction Optimizations

I mentioned earlier that RVM provides opportunities for the developer to give hints to the library on optimizing the performance of the library for the chosen application. Or in other words, the transaction semantics of RVN is already a stripped-down version of the traditional transaction semantic. It doesn't what you are able nested transaction and so on, but still transaction by it's very nature requires that a commit point, you have to do a synchronous to the I/O to the disc. And similarly transaction by it's very nature says that, it has an all or nothing property. So the transaction is not going to commit, that is that it's going to abort, then we have to make sure that all the changes that have been made between begin transaction and abort transaction are thrown away. And similarly, if the transaction commits, then we have to make sure that at the commit point all of the changes that have been made to in memory copies of persistent data structures are committed to the disk. That's where synchronous I/O comes in. But if the developer wants to optimize his performance, RVM gives opportunities for such optimizations. The first opportunity is the no_restore mode in the begin_xact call. The no_restore mode in begin_xact is signalling to RVN that this transaction that I am starting is not going to abort, and therefore there is no need for you to create an in memory, undo record for me. Even though I am going to give you a set range call, don't bother creating an undo record for the range of addresses that I intend to modify in this beginning transaction. That's what is meant by the no_restore mode. So that reduce the amount of work that the RVM has to do in doing a memory copy and the application is going to gain because RVM is doing less work which means the overhead in performing a transaction is going to be less as seen by the application. The second optimization opportunity is a no-flush mode in the end transaction. As I mentioned, a transaction truly has committed only when the changes to the critical section between begin transaction and end transaction have been synchronously written out to the disk. So, the normal semantic of an end transaction, that is a commit of a transaction, would require that our VM should block the process that made that call for end transaction, until that redo record has been written synchronously to the disk. But if the application developer is opportunistic and believes that the chances of failure either due to power failure or due to his own software caching is pretty small, he could be brave enough to say no-flush mode. And what that no-flush mode is telling the RVM library is that there is no need to do a synchronous I/O. Of course I want you to write it to the disk but don't block me in order to write to the disk. So no need to do the synchronous flash of the redo log to the disk. And, in other words, what we're getting by doing a no-flush mode in end transaction is lazy persistence. We know it is going to be persistent on the disk, that is the work that RVM is going to do, but it is not doing it exactly at the point of end transaction. So the upshot is, there is a window of vulnerability. End transaction happened here, and maybe by the time RVM gets to write it out to the disk, so much time has elapsed. So this time window is the window of vulnerability. So if this axis is time, then n transaction happened here, and this is the point where the redo record was committed to the disk, so this is the portion which we're calling as a window of vulnerability, and the app developer is taking a chance, saying that I am so sure that no crash is going to happen in this window, I'm going to go ahead and say no-flush. So, in other words, the transaction is being used as an insurance, and this should remind you of the old at the age that we saw when we talked about shared memory systems. Shared memory systems scale really well when you don't share memory. Similarly, transactional systems scale really well, perform really well, when you don't use the full semantic requirement of transaction, in particular if you can get rid of synchronous I/O, it'll make the performance better.


Remember the goal is performance efficient implementation of RVM. The restricted semantics of transaction help a lot in making sure that what we are designing as a reliable virtual memory is not very heavyweight. But the implementation of that used semantic, as to be efficient as well, and that's part of the reason why this is called light weight reliable virtual memory to indicate that it's light weight in terms of transaction properties. Now how to make it really perform well? The first thing is the strategy that they use for recording changes to the persistent portion of the virtual memory. The logging strategy is what is called no undo/redo value logging. No undo, meaning that we are creating an undo record of the changes that we're going to make to virtual memory, but it is not a log that is persistent on the disk. It is just an in-memory copy that is kept only for the duration of the transaction, and at the end of the transaction either it commits or aborts. We throw away that undo copy that we created. On the other hand, redo is the log that you create. First of all, in memory of the data structure in RBM, and we commit those data structure to memory, and in committing the changes of the redo logs, we are only writing the new value records of committed transactions to the log. So, even though the redo log. Consists of a transaction start and the changes that you're making, only new value records of commuter transactions are written to the log. Now this is the reason you have forward displacements, that a that we know where to append to the log segment on the disk there is a in memory version of the log segment too which you are writing the redo logs. And once you've written the redo logs, you're flushing it to the disk. On the disk, you have a on-disk version of this redo log record for this particular process, and what you're doing at a commit point is only writing the new value records of the committed transactions. A detail of the log segment. So we are only appending to that log segment the new changes that have been made within this transaction. So upon commit, what we need to do is we have to replace the old value records in the virtual memory with the new value records. But this is automatic because the way LRVM works is that it is created an undo record of the old value records of that portion of the virtual address base and all the changes that the developers making to the persistence data structures in memory are happening in memory and therefore replacing the old value records by the new value records in the virtual memory is automatic. Only if you abort, you have to undo the changes. But if you're committing then your virtual memory is already ready to go in terms of the changes that are being made within the transaction. At that point you have to force the redo log records to the log on the disc, and as I mentioned earlier, the optimization that's available in the implementation is to get transactions on the cheap. In particular, the no restore optimization allows implementation not to create an in memory undo record. That's time saved in terms of copying. And that means better performance for the application. And similarly, no flush, a commit point tells RBM that it can write the redo log to the disk lazily. It does not have to block the process that is making that call, the end transaction call. And that is an opportunity again to make that implementation more performance-conscious. So this lazy commitment of course has its downside, there is this window of vulnerability that I mentioned, and that is, there is a time window between n transaction and the point at which the redo log has been forced to the disc and this is the window of vulnerability and if The system caches within this time, then we have lost the redo records that we wrote to in memory. And that is important to understand, that there is a price you pay, in order to get this flexibility and performance in the implementation. You can see that this, redo log data structure, allows traversal in both directions. And this is for flexibility in implementing the RVM runtime. In particular writing to the on disk version of this redo log, having these forward displacements allows you to know where exactly you want to append to the existing redo log record from the changes that are being done in this particular transaction, and being committed at this point in time. And similarly the diverse displacements are helpful in traversing the log record during recovery.

Crash Recovery

So if you look at the redo log, it has a transaction header, and in between the transaction header and end mark are all the changes that have been made in that particular critical section by the developer. And for each of the address changed within that critical section, what is the new data that corresponds to the changes that have been made to that page? Similarly what is the new data for this range, new data for this range. That's the structure of this redo log record that has been created and forced to the disk. Now, when we resume from the crash, what we need to do is make sure that the external data segments Are updated with all the changes that have been made and recorded in the redo log but have not yet been applied to the external data segments. So crash recovery is of course the whole reason for LRVM. And what we're going to do when we resume from crash is read the redo log starting from the tail Of the entire log segment, and that's where the reverse displacements come into play. And once you've read the log from the disk, apply to the external data segments where it has to go to. All of that information is contained in the redo log record in terms of the transaction. What is the address range that is being modified? What is the external data segment that particular address range corresponds to? All that information is contained in the redo log record. So you can take that and apply to the external data segment. And once you've done that, you can throw away the log. So that's how Crash Recovery works.

Log Truncation

So far we've been pessimistic that we may have crashes, but hopefully crashes are not something that happen very often. [LAUGH] And if crashes don't happen that often, crash recovery is not that important. But on the other hand, if crashes are not happening but the system is progressing along, And what is going to happen is that we're going to create lots of log records on the disk as the system is making forward progress. So we have external data segments, which of course we need because that is were the persistent objects are actually contained, but we are also creating these redo log records. There are reflections of changes that we made to the in memory versions of these persistent data structures. Eventually one of these redo logs that represent changes to the external data segment to be applied to the external data segments. Now the only time we're going to do that Is when a crash happens, that's being very pessimistic. Also, we may end up clogging the disk with a number of redo log records. We've seen the need for log truncation in the distributed shared memory systems as well. In the case of DSM, those logs were clogging physical memory. In the case LRVM, these logs are clogging the disk space. Regardless, these are unnecessary overhead in terms of space and clutter, and also, if a particular application needs to map an external data segment, then we have to know whether that data segment is up to date or not. And that depends on whether or not there are some redo logs pending. To be applied to those external data segments. So all of these things suggest that what we need to do is truncate the log periodically. What exactly is truncating the log? It means that we want to read the logs from the disk and apply them to the external data segments and get rid of them. Now this sounds exactly like what I described to you happens when we do recovery from a crash. Therefore for Log truncation, simply apply crash recovery algorithm. So any time the system, meaning the LRVM run time, decides that it is time to do some clean up, what it is going to do is it's going to go and pick some logs to clean, bring those logs into memory Read the redo log records. Apply them to the appropriate data segment. And throw away the lock record. So that's what lock truncation is all about. Of course we don't want to stop the world in order to do this lock truncation. So what we want to do we is. We want to do this lock truncation in parallel with forward processing by the application. And the way LRVM allows that to happen. Is it splits the log record into Epochs. It says this is a portion of the log record that I've chosen the cleanup and this is a truncation Epoch, and so this is the part that I'm going to use to read from the desk and apply to the external data segment. And in parallel with that, I'm going to allow the application to make changes. This is a current epoch where the application is making changes to the log record. And the new record which is not yet being used. So, the idea is that we are allowing RVM to do it's work in terms of Log truncation by reading a portion of this log which is a truncation epoch portion of the log, and applying to the external data segments. And in parallel with that, we're also allowing the application to make forward progress by writing new log records to the current epoch. So the crash recovery algorithm is being applied to the part of the log that is in this truncation epoch while allowing forward processing to the part of the log which is the current epoch that the server is working on. The biggest challenge in implementing LRVM is the log truncation code because there's so much coordination that is needed Between what the LRV and run time has to do and what the application may be doing in terms of morphing the current log segment. You need the log segment for recovery but it will also overhead when there are no crashes. And they take up a lot of disc space. And puts extra burden on mapping and data segment through the subsystem that wants to use it. So, managing the logs, truncating the log as efficiently as possible is one of the heaviest problems according to the authors of this paper in implementing LRVM because it directly has a consequence on the performance of LRVM. And in fact the bulk of the heavy lifting that is done in implementing LRVM run time and make it really lightweight and efficient goes in doing this lock truncation efficiently. What I've described to you here, is one course level of lot truncation where We are taking the redo logs and applying it. A more finer grained way of implementing log truncation would be to look at in memory copy of the log segment also, and trying to make sure that we apply it to the external data segments so that we don't even incur the cost of writing a disk version of the redo log. That is even more complicated and I welcome you to read details of that in the paper.


LRVM is a classic systems research work. You understand what is the pain point for system developers. Once you understand the pain point, then it becomes easy to think about what solution you can bring to the table to solve it pain point. Managing persistence for critical data structures is the pain point that is identified by the LRVM work. LRVM proposes using lightweight transactions, that is, a transaction without all the heavyweight asset properties usually associated with transactions in database literature. And this lightweight transaction is going to give the needed persistent semantics for designing robust subsystems that can tolerate crashes.