ud098 ยป

Contents

## Course Introduction

Hello, and welcome to this review course, which is a prep for my graduate course on operating systems. When I teach graduate students on campus at Georgia Tech, I often find that there's a gap between the knowledge they have of operating systems at the start of the course and what they really should know in order to get the full benefit of my graduate class. This is not being judgmental. But it is just that, not all undergraduate operating systems courses are the same. And you may have taken a course on undergraduate operating systems a long time back. Or maybe you were not that enthusiastic about operating systems at the time that you took the course. This review course is not meant to be a substitute for an undergraduate education in systems. Rather, it cover the topics many students have expressed that they wished they had it before starting to take my class. Helping me with this review course is Charles Brubaker, a PhD graduate from Georgia Tech, now turned into a course developer for Udacity. Hi there. With this course, I'll try to give you concise explanations that will quickly put you into a better position. To fully understand the vectors and complete the projects of the advanced operating systems class. With these goals, let's get started.

## Introduction

The most important resources in a computer are space and time. Space in memory for storage, and time on central processing unit for computation. We may tend to think about these things separately, but there are important interactions between the two. And just as understanding the relationship between space and time is important for any good physicist. Understanding the relationship between efficient use of memory and then the CPU, is important for any good program. This interaction will be the subject of this lesson, as we look at cache systems and that all important abstraction of virtual memory.

## Cache Motivation

To help us see the need for caches, we'll examine this little snippet of code. From an application programmer's perspective, a statement like this one, the sum plus equals a of i, would seem to happen in one step. Even in assembly, it would only take a few instructions. In terms of time on the clock, however, not all of these instructions are the same. Sum, being a local variable, is most likely stored in a register in the CPU, therefore there is almost no cost to retrieving its value. a of i, however, is in main memory, and retrieving its value can take 100 CPU cycles. And even then, the bus bandwidth might only be able to pass back two bytes a cycle or thereabouts, to get the data back. Trips to main memory are expensive, and this trend has only become worse over the past few decades with hardware advances. The solution to this problem has been to place a smaller, faster set of memory made with faster technology, closer to the CPU. We call this memory a cache. And here, we're looking at a latency of about ten cycles. So, when the CPU wants the contents of some memory address, it looks in the cache first. If it finds the data it needs, great, we've just achieved a factor of ten speedups. We call this a cache hit. If the data it wants is not in the cache then we have to go to main memory. We call this a cache miss. The key is keeping these cache meshes infrequent, so that on average, time spent accessing is more like the time it takes to access the cache, than it is like the time to access main memory.

## Data Costs

To illustrate the power of the cache, as well as the importance of high hit rates. We'll do a few calculations in the form of a question. Suppose that retrieving data costs 4 cycles on a cache hit, but 100 cycles on a cache miss. Then how many cycles does it process with a 99% hit rate, spend on average? And then we want to compare that to one with only a 90% hit rate.

## Data Costs

For the first calculation, we have that 99% of the time it only costs four cycles, and 1% of the time it costs 100 cycles. For a total of 4.96. And for the second, we have that of 13.6. This may be a little bit counterintuitive to see that a relatively small change in the percentage can have such a dramatic change in the speed. And the key is to think about the fact that it's actually really the miss rate that matters here. We're improving it by a factor of ten, with having a 99% hit rate.

## Memory Hierarchy

Probably the most important consideration in cache design is the trade off between speed and capacity. To help understand the impact of this trade off, I'd like to visualize requests from the CPU to Main Memory like this, where the orange lines represent the requests to the CPU. We would like the cache hit rate to be high and the speed to be fast. These goals, however, are somewhat contradictory. We can have a small but fast cache and just accept that this would have a low hit rate, being only able to intercept quickly a few of the requests from the CPU. Or we could have a large cache, but this would be slow there. Either because the physical limitations like the speed of light, or because of the high cost of fast hardware. So we intercept more requests from the CPU. But, we aren't able to return them as quickly. What is the right trade off? Well, it's hard to say. But, fortunately we don't have to make just one choice. We can trade off side versus speed multiple times at various levels within something called the cache hierarchy. What is not in the CPU registers, we look for in an L1 cache. What's not in an L1 cache, we look in the L2 cache. A multicore processors, there might be even a third level of cache. What's not there, finally we look for in main memory. Even main memory, however, we can think of as a cache for what is on disk. Or we can also think about maybe main memory as a cache for what's on the file system. Notice here that the top is smaller, faster and costlier per byte. And as we go down the hierarchy, we get things that are bigger, slower and cheaper per byte. And if we do a good job of keeping the data that we're going to access next in the higher levels of the hierarchy. That is, if we can keep our hit rates high, then making at the speed of the top part of the pyramid that has the capacity in the overall cost per byte of the lower end of the pyramid.

## Fill in the Table

Okay. So let's do, quick exercise here. Suppose a 5 bit address space and the cache contents, given in this table here. Fill in this table, for each address, indicate whether it's a hit or miss in the cache, and if it's hit, provide the value, provided by the cache.

## Fill in the Table

Okay, here are the answers that I came up with, let's go over this together. As in our example we have one bit for the offset, and then the next two are the index. So this here, I'm going to look at these two bits, see index 00, and then go over and look at the cache. I see there that the tag is 0 1, and oh that does not match my 1 0 tag of the part of this address here, so that's a miss. For the next one again I begin by consulting the index, that's 11, so I go over here, and I find that the tags part of my address and of the cache match. So that looks like a hit. Which of these two's values is it? Well, for that I'd look at the offset, and I'd see that it's the second one. So I write the value 99. For the third one, again, the index is here, 1 0. I go and look at that in my table. I see I have a tag,1 0, and oh, that matches the tag part of this address. So it's another hit. Which of these two values is it? Well I go look at the offset, that's zero, so it's this first one, 5 7. And lastly, again we'll start by looking at the index That's one one, and up, the tags do not match, so that's a miss.

## How Many Bits

Okay, here's another question. Suppose a 22 bit address space, a 512 byte cache, and a cache block size of 64 bytes. And I want you to tell me how many bits are used for the index, and how many bits are used for the tag.

## How Many Bits

Alright? Here are my calculations. I saw that we had So, how many entries could there be? Well, we just do the division. We find there are eight. So, that's three bits worth of index. And then to calculate how many bits are used for the tag. I start out with all 22 bits. I remember that three are used for the index, and since we have 64 byte cache lines, I say that there are six used for the offset, and that leaves me with 13 left over.

## Set Associateive Mapping

It's possible to get unlucky using direct mapped caches, in that different blocks that are important to your application might get mapped to the same index in the cache. For instance, if an application happens to access two cache lines that have the same index, maybe this one and this one here, then they will constantly be evicting each other from the cache. And it, it will be essentially useless. This, of course, is rather silly given that there are all these other entries in the cache that aren't getting used at all. The fundamental problem is that the address, any one of these here, is only associated with one location in the cache. We can mitigate these effects by associating an address with more than one block in the cache. One strategy is to treat all the red lines as blue ones, and all of the orange lines as green ones. If we do this then we'll have twice as many places in the cache to store a blue line of memory. Actually, it's more convenient to think about the blue lines in the cache being together. And in this way, we can say that we have simply decided to ignore the most significant bit of our previous index, but to allow a cache block to now be stored in two possible places. Hence, this would be called a two way associative cache. The downside is that we have to check two tags now to see if we have a cache hit. But we are less likely run into a problem, where we are constantly evicting the memory we need from the cache. If we decide this isn't good enough, we can halve the number of indices again to create a four way associative cache. Returning to our general notation, we now say that we have two to the k minus j indices into our cache, and the index is really only k minus j bits long now. This greater associativity also gives the cache some choice about replacement. Whereas with the direct mount cache, there was only one block in the cache where the data could be stored. We now have a set of possible locations. Just two in our example but maybe more. Typically, we choose to evict the entry that was least recently used.

## 2 Associative Question

Actually, let's make the next address a question. Is this address a hit or a miss?

## 2 Associative Question

The answer is that it's a miss. We find the index one, we're look in here and up, the tags don't match. It does match this tag up here, but that has the wrong index.

## Fully Associative Mapping

Now we follow the idea of associativety to the extreme and let any cache block be put anywhere in the cache. Returning to our example, if this was a direct map cache, where every block has only one place to go. And this was a set associative cache, where there was a small set of places that a block can go. Then this is a fully associative cache where all cache blocks are treated alike and can be put anywhere in the cache. Looking at a more general notation, then we see that full associativity means we don't interpret any of the address as index anymore. It's all tag. This means that when we ask if a particular address is stored in the cache. The hardware has to check tags for every cache entry. Or at least, do it sequentially really, really fast. This fact, generally limits fully associative caches to sizes between 64 and 256 entries

## Write Policy

So far we have focused mainly on reading data. But what happens when we want to write to memory? Here we'll consider a single core system. I'll refer you to an architecture class for a shared memory system where keeping a cache consistent becomes a challenge. We need to consider both what happens on a hit when the block we want to write to is already in the cache. And what happens on a miss, when it isn't. On a hit, the most obvious thing to do is to write-through to memory. That is, we'll write to the cache, and then also write to the main memory to copy the data, and keep the cache and main memory consistent. As long as I don't have to worry about some other entity needing an updated version from memory however, there's actually no need to write everything this instant. This is the insight behind the write-back policy which only writes to the cache. After all, that's where we'll look first if the same processor tries to read the data or write to it later. Only when the cache block gets evicted sometime later does the data really need to be written to main memory. And then it'll be the whole cache block. This lazy approach can be beneficial when the same cache block is written to many times before [INAUDIBLE]. For cache misses we also have two strategies. The first is write-allocate, which first reads the current value in memory into the cache. Then it behaves as though it were a hit, using either of the two hit strategies. Lastly we have the no-write allocate policy, where we just write the data to memory and don't bother with the cache at all. The interactions among these policies can be subtle, particularly when there are multiple cache levels. It is worth pointing out however, that some of these policies share certain assumptions, and are therefore often paired together. The write back policy is betting that there will be subsequent writes to the same cache block. That's why he is bundling up all the changes he is making for one trip to memory, when the block gets evicted. Write allocate is also betting on subsequent writes to the same block. That's why he loads the data up into the cache. On the other hand, write-through is betting on the cache block being evicted sooner rather than later. He does his writes to memory up front, so that he can quickly evict the cap block from the cache later if he has to, not having to worry about any writes to the disk. Similarly, the no-write allocate policy is also betting on that same cached block not being used again soon. It doesn't even bring the data into the cache.

The chief goal of a virtual memory system is to provide a process with the abstraction that it has an address space all to itself, typically organized like this. We have addresses for the program code. We have addresses for variables that are initialized, literals, constants, things like that. Space for uninitialized global variables that we expect to change over the course of the program. Space for the heap which is dynamically allocated and might grow or shrink over the course of the program and space for the user stack the procedures, local variables, all that sort of thing. There are also addresses reserved for the kernel, we'll see how this improves efficiency later in the lesson. The advantage of this abstration are best seen from the compiler's perspective. The compiler can choose an address for a local variable or code for the body of a procedure without having to worry about what other processes might be running on the computer at the same time as the application is compiling. And what physical memory those applications might be using. He gets to pretend that the application and the OS will have the computer entirely to themselves.

## Paging

Recall how for caches we divided memory into blocks, or cache lines. And this's the granularity at which caching operates. Similarly, for a paging system we divide up our physical address into pages, often 4k long. Pages, are the granularity at which we grant memory to processes. The higher order bytes determine the page number, also called the physical page frame. And the lower order bytes determine the offset into the page. Correspondingly, the virtual address space is similarly divided into pages. Again, lower order bytes determine the offset within a page. The higher ones determine the virtual page number. The number of bytes in the offset always match, that's because virtual page sizes and physical page sizes aren't necessarily the same. But, the number of bits involved in the virtual page number need not match the number of bits used to identify the physical frame number. That's because, as we said before, it's possible to have more virtual pages than physical pages. Since we're mapping page for page between virtual and physical addresses, the offsets stay the same. We do need to translate page numbers however. And for this, we use a data structure called the Page Table.

## Virtual Page Numbers

Here's a quick question to make sure we're on the same page, so to speak. Consider a 16-bit virtual address space and a 512 byte page size. How many virtual page numbers are there? You can assume byte addressing.

## Virtual Page Numbers

Here's how I write to my answer. The fact that we have a This leaves us with 7 bits for the virtual page number, and that gives us 128 possible pages since two to the seventh is equal to 128.

## Virtually Indexed, Physically Tagged Caches

You can't resist talking about one final optimization. Looking at this load method, notice how the translation to the physical address and cache look-up happen in sequence. We could speed up our implementation if we could find a way to do this in parallel. At first this might seem impossible. After all, the cache is indexed by the physical address We don't know this until we've done the translation. Recall, however, that only the higher order bits are changed when we go from the virtual address to the physical address. The offsets within the page stays exactly the same. Which bits are important for extracting data from the cache? Well it's these middlest bits that come after the offset for the cache block. As long as the index bits for the cache don't extend beyond the offset for the page, this little red line here, then we can extract data from the cache. Before we know the full physical address. Conceptually, we're going to do something like this where the virtual page number goes to the TLB and the page offset goes to the cache. The tag, or tags in the said associative case, then gets compared with the page frame number from the TLB, and the correct data is returned.

## Number of Entries in a Virtually Indexed, Physically Tagged Cache

Here's a question to test your understanding of the virtually indexed, physically tagged cache. Suppose 32-bit virtual and physical address spaces. Pages are four kilobytes, and cache blocks are 64 bytes. What is the maximum number of entries, in a virtually indexed, physically tagged cache?

## Number of Entries in a Virtually Indexed, Physically Tagged Cache

Here's my answer. The 4k page size implies a 12 bit page offset. The the size of the virtual and physical address spaces don't matter at all.

## Conclusion

This completes our lesson on memory systems and if you followed along, you now have an advanced understanding of how memory systems work. This should make you a better programmer and put you in a good position to think about some of the complex trade-offs involved in CPU scheduling and maintaining cache coherence, especailly in the context of shared memory machines where there are multiple CPU's. There is always plenty more to learn.