ud710 ยป


System Architecture

Hello, and welcome. In this module, you will implement the components of a proxy server webcache. And experiment with various cache policies. It's an opportunity to get practical experience with remote procedure calls, HTTP requests, and some powerful data structures. Pay close attention to the requirements in the instructions. And be sure to look at the documentation of the libraries that you will need to use. You'll be using rpcgen and curl in some very simple ways. In some cases only small modifications to the example code and then documentation are necessary. At the end of the projects, you will have all the pieces needed to assemble a proxy server web cache. Good luck. We'll start by describing the overall architecture of the system we're going to build. First we have a set of local machines. And users on these machines want content from the servers of the world wide web. We could let them all request data individually. This would work. But if our clients all want the same data, it could be rather inefficient to have all that data travel all the way across the internet, multiple times. So, we introduce a proxy at our site, that all our local machine connect to. This proxy contains a cache that stores part of the world wide web, hopefully the data that is going to be requested next, by one of the server's clients. When our local machine makes a request, the proxy looks in its cache first. And if it finds what the clients is looking for, it just returns the data. Otherwise, it makes the request to the web, and then returns back the data to the client. Now these local machines will communicate with the proxy by remote procedure call, or RPC. So, to a developer writing an application for one of these machines, getting data from the web is going to feel just like making a procedure call. Pretty cool. The proxy server on the other hand, actually has to use HTTP to get data from the internet. For this you will curl. A very popular client side URL transfer library that hides all of the messiness of sockets, the HTTP protocol, et cetera. Getting these two communications to work will be the first two parts of the project. The third part will be implementing different replacement policy procedures, for the proxy server's cache. We're assuming he can't hold the whole internet, so we have to prioritize which pages to hold. And we do this on the basis on what request the proxy has seen from the clients so far. Now no replacement policy can be the best for all request patterns from the clients. And illustrate this point, we ask you to give a request pattern that makes one policy do substantially better than another, for the last part of the assignment.

Remote Procedure Calls

The first component we'll work on, is the remote procedure call communication, between the local machines, and the proxy server. We need code to handle the communication, on the server side, and the client side. Thankfully, there's a nice little tool called RPC Gen, that lets us abstract away many of the details, and do little more than specify the signatures, of the procedures we want to call remotely. We describe the interface for communication, using a language called RPC-L. In a file, which in our application, we'll call proxy_rpc.x. Then we run rpcgen which creates three C language files for us. proxy_rpc_svr.c, proxy_rpc_clnt.c, and proxy_rpc.h. As you might have guessed the server file ends up on the server side. The client file ends up on the client side. And the header file is used on both. These files handle the communication, but it's still up to us to write the procedure on the server side, and to call it on the client side. For this part of the project, we'll provide the server side implementation, which will just return the string that says success. Your job, is to write the client part, that actually calls the remote procedure. And you also need to describe the communication interface in the proxy_rpc.x file. The RPC programming guide, provided as a link, will be very useful.

Using cURL

Now we turn to the problem of retrieving content from the web to our proxy server. Remember that on the sever, we have the files, proxy_rpc_server.c,and proxy_rpc.h, which handle the communication. And we need to implement the procedures that our remote clients are going to call. We do this in a file, proxy.c, and we'll link it to the other files when we compile. The port of entry into our code will be the implementation of the HTTP get method. Within this function, you should use the CURL library to retrieve the content from the web and return it back to the client. A link to the documentation for CURL and a relevant example are provided in the notes. Good luck.

Cache Implementation

Next, you are to emplement the cache that lives on the proxy server. Your interface will be like this. The initializatoin method specifies the cache capacity, and also a minimum entry size, and a number of levels. This minimum entry size shouldn't prevent smaller entries from coming into the cache, and you shouldn't make smaller entries bigger. Instead, this is meant to give the cache a maximum number of entries that it could possibly store. That is the maximum number of entries is the capacity divided by this minimum size. The last parameter is only important for the [UNKNOWN] cache replace policy, which we'll discuss later. The cache is indexed by URLs which are represented as strings. You don't have to canonize the URLs. That is www.example.com is not the same as example.com for our purposes. Naturally, the set method takes in not only the URL, but also the data as a value, as well as it's size. The get method takes in the URL as the key and returns the data by the return mechanism. And the size of the entry, find an address passed in as a parameter. If the client doesn't care how big the entry is. He should be able to pass in no. Next we have the memused procedure which adjust return the sum of the sizes of the entry currently in the cache. Not how much memory your cache is actually. And last, we have the destroy method which the necessary memory cleanup.


Actually for this project you're going to implement the GT cache interface not once, but four times. One for each of the four replacement policies that we're going to discuss. It is important that your implementation be efficient. This is not because it would slow down the clients experience on a cache miss particularly. The dominant factor will probably be the latency in getting the data from the web. Rather it's because we want to free up our proxy server To serve other requests and possibly do other work. Cache misses for web requests are going to be relatively frequent.

LFU Replacement

The first cache replacement policy we will discuss is LFU, or the least frequently used policy. The idea is that an entry has the fewest number of hits since it was put in the cache should be evicted. Implementing this efficiently actually turns out to be rather complicated. If we start it with one of the other policies and then implemented them in an actual way. The form implemintations would end up looking quite different. By starting with LFU we can make them look similar and make your task of implementing them easier. Because we're allowed to assume a minimum cache entry size we have an upper bound on the number of entries this lets us go ahead and allocate memory for all of the cache entries that we need. Each of these will have inside the URL, the web data, and any other information that is needed. Of course we'll need a Hashtable that maps the URL to the appropriate entry. I'll do that by index integer right here, but you could also do it by the memory address if you'd like. Then to keep track of who should be evicted next We need a priority. Notice that we aren't just inserting new elements and deleting the myth. Some element in the middle, let's take the example of Twitter which has ID six here, might have its key value changed. Maybe it gets five more hits making it ten. Well then it should sink down in the queue away from the minimum and swap places with ID zero, that is Google. Notice that Twitter could have been anywhere in the priority queue. And we need a fast way to find him. This is why we assign all the entries IDs. We use an Indexed Priority Queue that gives us constant time access to any of the elements and then we can have the element sync to the appropriate spot in the queue. This is actually the same data structure that one needs for Dijkstra's shortest path algorithm. Only their vertices swim up. Towards the min as we find shorter paths rather than sync down as they do in our case. There is one remaining problem of how to assign new entries unique IDs. For this, we use a simple stack that starts out with the 0 through n minus one and, and we push and pop IDs as they are assigned and freed up. There are some ways to implement LFU Replacement and you'll get some credit for anything that is correct. But to get full credit your implementation should be sub linear. No walking your way through a list of cache entries to find anything.

LRU Replacement

Perhaps unsurprisingly, the least frequently used replacement can easily be changed to the least recently used with a few modifications. Instead of using times used in our priority queue, we'll the use output of get time of day. Given that you've already implemented LFU, this is probably easiest. And this is how I recommend that you complete the assignment. Doing it this way will also make implementing the LRU minute algorithm a little bit easier. It is worth point out however, that if were implementing LRU from scratch, we wouldn't use this fancy index priority queue. Since no matter which element changes, that is no matter which element gets a hit, it always gets moved to the bottom of the queue since it's now the most recently used. The potential log in checks that the index priority queue does are wasted, and we could replace the priority queue with a simple list. If we don't feel the need to preallocate space for all the cache entry data structures, we could also eliminate the need for IDs. So, if we don't feel the need to preallocate space for all the cache entry data structures, we could eliminate the need for IDs. Each value in the hash table could be a pointer to a specially allocated cache entry. And each of these cache entries would contain a pointer to the cache entries place in the eviction list. This is perhaps a more natural and elegant solution for LRU-min replacement. But for efficiency in completing this assignment I recommend that you modify LFU as little as possible.

Random Replacement

Random replacement can also be implemented using almost the same data structure. It is tempting to think that we can just give a random value and use that as the key in the priority queue. But then, the choices for which element gets evicted are not independent anymore, which is not what we want. So, we need to replace the priority queue with a random queue. For this, I've provided a data structure that supports fast deletion, in case you find that useful. I encourage you to use it. You could also implement this with a simple array easily. If you're implementing this from scratch that's probably what you would do.

LRU Min Replacement

Lastly, you are to implement a replacement policy called LRU-Min. One of the weaknesses of LRU is that many smaller entries can be evicted to make room for a larger one. We can visualize this phenomenon with a 2D graph, we'll put the size of the cache entry on the Y axis and the last time used on the X axis. Suppose that a large entry needs to be put into the cache, so we need to clear this much space, say, S. The least recently used items are much smaller than this, meaning that we would have to delete many of them. LRU-Min addresses this problem, by trying to remove the least recently used item that will make enough room. That is, the least recently used item above this imaginary line across from S. Now in practice, letting the line be anywhere on the graph would be slow, so we discretized the cache entries into bands, according to powers of two. Remember, this whole thing is just a heuristic. We maintain a priority queue for each band. If we need to clear S space out of the cache, we find the band that S belongs to, and find the least recently used cache entry that falls in a band above S's. If we find one, then we're done. We've created enough room in the cache. Suppose, however, that we don't find one. Well, in that case, we try clearing from S's band, and if there's nothing there, we drop down a band, and so forth. And once we've found an element to delete, if we still don't have enough room, we recurse. Those are all the cache replacement policies. Comments in the partial source code provided should be helpful. Have fun implementing them.

Data Selection

Unless it's possible to see into the future, no cash policy can be the best against every request pattern. To illustrate this point and to help you understand the cash policies better. We've made the next part of your assignment to come up with certain request patterns that cause one policy to perform better than another, across a range of cash sizes. A program that does the performance test, is provided to you, along with the rest of the project source code. To keep the evaluation simple, we've restricted you to around 100 URLs and a data file that we provide. Have fun with this one.