What are the systems issues, in managing large data centers? How do you program big data applications, such as search engines, to run on massively large clusters? How do you store and disseminate content on the web, in a scalable manner? Hello and welcome back. These are the kinds of the questions we will be addressing in this module of the course. This is what we mean when we use the term internet scale computing. A comforting factor for you is that you already know the answer to these questions. We've learned through the previous lessons, most of the techniques needed in constructing distributed services. What we're going to see in the upcoming lesson, is a hardening of the distributed systems issues to handle scale on the order of thousands of processes and failures. It's not a question of if. It's a question of when a failure's going to happen.
Let's start off this lesson with a fun trivia question. Check off the ones that you see below as giant scale services. I just want you to think what are the services that you use on sort of an everyday basis which you think are giant scale services. Online airline reservation system, purchasing, for instance, your next laptop online, all of you may be using web mail, searches that you may be doing on the Internet, and let's say this weekend you want to watch a movie and you're streaming it from Netflix.
As you may have guessed, the right answer is all of the above. yes, these days, most of what you do on an everyday basis with a few clicks on your personal mobile device, or laptop, or desktop, it's actually using some giant scale service somewhere out there on the Internet. All the way from airline reservation to watching a movie online.
In this module, we will cover three topics. The first is systems issues in providing giant scale services on the internet. And the second topic is what exactly is a programming model that is used for the applications that are delivering the service that you're expecting on an every day basis? So-called Big Data applications. And the third topic is content distribution networks that are used for the dissemination of the content when you access giant scale services on the internet. So in this lesson, of course, we are focusing on what are the issues in giant scale services, from the perspective of the system.
Imagine you're going on the Internet and reaching a Web portal, such as gmail, let's say, for accessing your email. In fact, while you're doing this, a million other clients are doing exactly the same thing, trying to reach the web portal through the IP network. The service that you're trying to reach may be architected to direct your particular request to a specific site. And the architecture within that site may look like this, a whole bunch of servers, maybe on the order of thousands or even 10,000, within a cold dark room, interconnected to one another through perhaps a high bandwidth communication backplane. And also all of the servers are connected to data stores for processing requests that may be coming in. And the servers may optionally use a backplane that allows them to talk to one another for servicing any particular request. Any of the servers may be able to handle an incoming client request. Now in the generic picture that I'm showing you here, the load manager that is sitting in between these servers, and the IP network redirects an incoming client's request to one the servers. The chosen server may than process the query and answer the client. The role of the load manager is to balance the client traffic among all the servers. Basically, the load manager has to make sure that no particular server is overloaded. One of the characteristics of most of the giant scale services is that the these client requests that are coming in are all independent of one another. Sometimes, this also called embarrassingly parallel, meaning that all of these client requests can be handled in parallel. So long as it is enough server capacity to meet all the incoming requests. So the rule of the load manager is to balance the client traffic and redirect them to the server so that no particular server is overloaded and all of the servers get equally utilized in servicing incoming requests. The other responsibility of the load manager is to hide partial failures. Because when we're talking about the scale of giant scale services, and the kind of computational power that resides in the data center for servicing these giant scale services, they're humongous, they're 10,000 nodes. And when you have so many nodes, compute nodes and data nodes and so on, it's not a question of if something will fail, but is a question of when something is going to fail. So the load manager has to make sure that it is observing the state of the servers and insuring that the incoming client requests are shielded from any such failures that may happen internally, within a particular site.
Compuptational clusters are the work horses of giant scale services, any modern data centers employs on the order of thousands and thousands of computational nodes connected by high performance, high speed networks. It is such computational clusters, which are serving as the work horses for giant scale services today. And this figure reproduced from Eddie Gruer's paper circa 2,000 shows the number of nodes in a data center, maybe on the order of 1,000. But scale it up by ten or 100 in terms of today's computation capacity that you might find in a typical data center. And the kind of volume of queries that are going to be handled. By giant scale services today. Each node in the cluster may itself be an SMP. There are serveral virtues for structuring the computational resources inside a data center as a cluster of machines connected by high performance back planes. The advantages include absolute scalability. You can keep adding more resources. Without worrying about re-architecting the internals of the data center. Cost and performance is another important issue and that is by having computational clusters at every node is identical to others, is that as a system administrator we can control both the cost, And the performance of running a data center like that, and because of the independent components you can easily mix and match generational changes in the hardware, and the results are incremental scaleability you add more resources you get more. In terms of performance because, as I told you earlier, most of the queries that come into giant-scale services tend to be embarrassingly parallel. So the more resources you've got, the more queries you can handle. There is incremental scalability when you're using clusters because you can add more nodes to the cluster, increasing the performance. Or, if the volume, of request come down, you can scale back. That's the advantage of using computational clusters as the work horses for data centers that are serving giant scale services today.
As a refresher, I'm showing you the seven layer OSI reference model of the protocol stack. Now, load management, as I mentioned, for dealing with client requests, can be done at any of the levels of the seven layer OSI model. From the network layer and up. And higher the layer, you have more functionality that you can associate with the load manager in the terms of, how it deals with the server failures, how it deals with directing incoming client requests to the different servers, all of these are possible, the higher the layer at which you construct the load manager.
The load manager may be operating at the network level, that is, at the network layer level of the seven layer OSI reference model. In this case, the load manager is simply a round robin domain name server. What it does is, when client request comes in, they all have the same domain name, in order to come to this particular server. If you say gmail.com, then depending on how the server is architected, the gmail request coming from particular client may go to a particular site. And when it goes to that site, the domain name server, which is acting as a load manager, what it is going to do is, it is going to direct the incoming client request to one of the servers. So it is giving different IP addresses. Even though the domain name a client is coming in is exactly the same, the round robin DNS server assigns different IP addresses corresponding to different servers to the incoming client request. And that way these clients are being redirected to different servers, and because the load manager is doing it at the level of individual server addresses, you get very good load balance. And the the inherent assumption in this model, is that all the servers are identical, so an incoming request can be sent to any one of these servers, and perhaps the model is also that the data is fully replicated. So that if an incoming request goes to this server, it has access to all the data. If it goes to this server, it has access to all the data to satisfy the incoming client request. So the pro is good load balance because the Round Robin DNS scheduler can choose the least loaded server to redirect an incoming client to that particular server. But the con is, it cannot hide down server nodes, because it is assigning IP addresses of the server to the incoming request, the load manager is not able to hide down server nodes from the external world. So now we move the load manager up in the OSI reference model. Up to the transport level or even higher. The load manager could be layer four switches. That is transport level switches or even higher level switches. And, in fact, the layer four switches may be architected as switch pairs so that there is hot fail over from one fail switch to another fail switch at the load balancer lever. When you do this architectecting the load balancer at the transport or higher levels. It give you the opportunity to dynamically isolate down server nodes from the external world, and it is also possible, with this architecture of load manager, to have service specific front end nodes. For example, instead of dealing with arbitrary client requests, we're actually dealing with requests based on the kind of requests coming in from a particular client. For instance, this could be a Gmail client. This could be a Google search client. This could be for a photo server like Picasa. So those are different functionalities that can be implemented at the front end of this load manager to look at the service, the particular program, that is actually triggering a client request coming in to this particular site, and make decisions as to who should serve that particular request. So in other words, we're building more intelligence into the load manager. So that it can know the semantics of the client server communication that is going to happen based on the type of request that is coming in. And it is also possible, with this structure of having the load management at higher levels of the OSI reference model, to co-opt the client devices in the load management. For example, it's possible that the load manager may even know the device characteristics that the client is using, in order to come into the site. For example, if it's a smart phone, it might take certain actions commensurate with the device that is coming in, in terms of structuring the client-server interactions. Now, the data itself may be partitioned among all the servers. For example, let's say that you are doing a web search and it is a web search that came in as a clients request. And, if it is redirected to a particular server, this server may not have access to all the data that it needs. In order to do that, Google search that you came in with. In that case, the server may communicate using the back plane with other servers to get the data that it needs to serve the queries that is coming in, so that there is complete coverage of whatever the intended query needs to get as a result. Of course, if the data is partitioned, if a server that is holding a portion of the data is down, then there'll be data loss for the query. So for your specific search, the server that is handling your search may not be able to give you complete result for the search, because a portion of the data is down due to the partitioning of the data among the different servers at this site. Now, this is where replicating the servers for redundancy helps. So even though you may partition it, you may replicate partitions so that it is possible that even if one server is down, another server that has a replica of the same partition may be able to provide the data that is needed in order to satisfy a particular query.
Thus far, I've given you a general idea of how an individual site that hosts computation resources and handles a whole bunch of incoming client requests may handle these incoming client requests in terms of load management. This brings us to the DQ principle. The server has all the data that is required for dealing with incoming client queries. So this the full data set, or DF if the full data set, that is required for handling any incoming queries to the server. Now clients come in with their queries and we will call Q0 as the offered load to the server, meaning this is the amount of requests that are hitting the unit per server time. Even though the offered load is still Q0, maybe the server could only manage to serve a portion of this incoming requests. And that portion of the incoming requests that is completed by the server is Qc. These are the completed requests. We define the yield Q as the ratio. Qc over Qo. That is the ratio of completed requests to the offered load. So, Q is going to be a fraction between zero and one. Ideally you want it to be one so that all the client requests are serviced, but if the server is not able to deal with the offered load entirely, then the yield is going to be less than one. Now each query that is coming into the server May require processing all of the data that's available in the server to answer this particular query. For instance, if there's a Google search and the search requires looking at all the corpus of data that the server has to answer the query, then you want to look at the full data set DF. However, it may be that because of either failures of some of the data servers or the load on the server, it is only able to process the query using a portion of the total data set, Dv. Dv is the available data. That is used in processing the query, in that case the harvest D is defined as the ratio Dv over Df. That is harvest is defined as the ratio of the available data to the full corpus of data. Again, this is going to be a ratio between zero and one. Ideally, you want to make sure that the harvest is one, meaning that incoming requests is completely served with all the data that it wants. Depending on the service I can give you different examples. So for instance if the incoming request is a search, you want to look at the whole corpus of data that is available to you to answer that search. But if you are not able to do that and the server is only using a portion of the full data in the harvest in terms of the quality of the results that is being given out to a particular client is less than one.
The product DQ, that is the data server query, and the rate of query that is coming into the server. This DQ represents some sort of system limit for a given server capacity the product DQ is a constant. Said differently for a different system capacity, we can increase the number of clients that we are serving if we reduce the amount of data that we're using to process the incoming queries. That is, we're increasing Q by decreasing the harvest or D. Or can do the opposite and that is, we can entertain a smaller set of queries. That is the yield will come down but we will be able to give the complete data that is needed for serving that query. An example of would be, if I am a client that is accessing a server for my mail, I would want to get all my mail, not part of the mail. So in that case, I will be happy if the server gives me a harvest of one, even it means that not all the incoming clients maybe served by the server due to the capacity limitation of the server at any point of time. So as a system administrator, we have some choices to play with. The important thing is that in giant's scale services. The performance is limited by network and not by the I/O capacity. We can have as much I/O capacity as we want by having more hardware resources here, but what we are bound by is the network capacity. So if we increase the capacity of a server, we have a choice as a system administrator. We can either increase the harvest, keeping the yield the same. Or, we can increase the number of clients we are serving. That is, we can increase the yield, keeping D a constant. That's the knob that a system administrator can play with in terms of dealing with capacity increases at the server. Similarly, if I am a system administrator, and some nodes in the server fail, then again we have a choice of either decreasing the harvest or decreasing the yield, in order to deal with reduced DQ value. Of course, it all depends on how the server itself is architected, if the failures can be hidden through replication and so on. But the important message I want to convey to you is that this DQ represents a system constant, so far as the server is concerned, in terms of the capacity. Given a particular capacity of the server, DQ is fixed. So as a system administrator you have a choice of either sacrificing yield for harvest or harvest for yield. You cannot increase both the harvest and yield without increasing the server capacity. At traditional measure that has been used in servers, is the notion of IOOPS, or I/O operations per second, and these are meaningful in database kinds of applications. Where the applications are disbound, but the giant scale applications are network bound, and for network bound applications this manager DQ is much more intuitive as to what is going on in terms of managing the incoming load to the server - And managing the corpus of data that the server has to look at in order to satisfy incoming queries. Another metric that system administrators have often used is what is called up time. You may have heard of terminologies like mean time between failure and mean time to repair. Meantime between failure is, what is the expected time between two successive failures inside a data center? And similarly meantime to repair is if a failure is detected, how much time does it take to bring up a failed server, that need time to repair. And the up-time is defined as the ratio of the difference between, mean time between failure and MTTR, that is mean time to repair, and MTBF, that is uptime is equal to MTBF minus MTTR divided my MTBF. That is uptime. And you want the the uptime to be close to one. And usually uptime is measured in nines. Meaning point nine, nine, nine, nine. Five nines or six nines and so on. So you want the up time to be as close to one as possible for giant scale services. But again, this up time as a metric is not very satisfying because if there are no queries giving the time that is needed for repairing, that is, during the MTT par, the mean time to repair, if no queries come into the server, then you haven't made anybody unhappy. In that sense, the uptime is not a very intuitive measure of how well a server is performing. It is better to look at yield as the output metric. And the harvest, again, is the output metric for say how well a server is able to deal with the dynamism and scale of requesting handled by a particular giant-scale service. We'll see that this DQ principle is very powerful in advising the system administrator on how to architect the system. How much to replicate? How much to partition in terms of the data set that the server is handling? How to deal with failures? How to gracefully degrade the servers when the volume of incoming traffic increases beyond a server capacity? All of these are questions that can be handled very elegantly using the DQ principle. And the key underlying assumption in the DQ principle, is that the giant scaled services are network bound, and not I/O bound.
In architecting the data store and the computational resources that back the data store at a server, the system administrator has a choice of either replicating or partitioning the data. If you replicating the data what that means is. That every data server has the full corpus of data that is needed to serve a particular request. Think of it as say, gmail. If it is gmail, then if a gmail request gets sent to anyone of these servers. They can handle the request that comes in because they have the full corpus of data to deal with a incoming request. We mentioned that failures are inevitable in giant scale services because of the number of computation elements that live within a data center. If failures occur, what happens when we have replicated data? What they means is that the harvest is going to be unchanged because if some data repository is failed. Because it is replicated, we can redirect a client's request to another live server that has full access to the later repository and therefore the harvest that you're going to get with replication in 100%, it is unaffected. On the other hand because the server capacity has come down due to these failures, the heal is going to come down. In other words, with replication. The service is unavailable for some users for some amount of time. But all the users that are able to get the service, get complete harvest in term of the fidelity for the query that is returned and answered by the server. The other alternative of course for architecting the internal data depository of the server is to partition the data. So if you partition the data, and lets say that There are end partitions of the full compass of data. And let's say that some of the service fail, then some portion of the corpus of data becomes unavailable. If this happens, what it means is that the harvest is going to be done because some portion of the data corpus. Is unavailable, and therefore the harvest is going to come down, but it is possible to keep Q unchanged, so long as the computation capacity is unchanged. Then we can serve the same number of user community. It's just that the fidelity of the result may not be as good as when All of the data corpuses available. For example, if this is a server that is serving search results and if some portion of the data repository is unavailable. In that case, each query that comes in will get service, but it is going to get. A subset of the query results for the search that they are doing. So the important message that I want to convey through these pictures is that dq is independent of whether we are replicating or we are partitioning the data. Given a certain server capacity, The D Qs are constant, because we are assuming that this is cheap. This space is cheap and therefore processing incoming requests are not disc bound, but it is network bound. The only exception is if the queries Result in significant in right traffic to the disk. Now this is rare in giant scale services. But if that happens, then replication may require more DQ than partitioning. But as a first order of approximation, so long as we assume that giant scale services are serving client requests which are network bound, The DQ is independent of the strategy that is used by the system administrator for managing the data whether it is replication or partitioning. Now what this also means is that this is giving an object lesson to the system administrator. In saying that beyond a certain point, you probably want to replicate and partition. Why? Because failures are bound to happen, and a failures are bound to happen if you partition the data, then a portion of the data corpus becomes unavailable. On the other hand, if beyond a certain point Even if you have partitioned data, if you replicate it, then you're making sure that, even if one replica is down, some of the replica, of the same partition is available for serving a particular client request. And replication, beyond a certain point is important because Users would normally prefer complete data or a complete harvest. Take for instance, you're accessing your email through the Internet. You would rather put up with a service being unavailable for some amount of time rather than seeing that you only have access to a portion of your main box. And that's the reason why replication beyond a certain point is better for dealing with giant scale services. On the other hand, for searches, it may be okay to have a harvest which is not complete. So in structuring different services as a giant scale service, one has to make a decision. Of whether partitioning is the right approach or at what point we may want to partition and then replicate as well. As a concrete example, Inktomi which is a CDN server uses partial replication It uses full replication for email because of what I just said, and that is a user would expect complete harvest for a service like email. But on the other hand, if it is a web cache that is being implemented as giant-scale service, it is okay if it is not replicated.
And the DQ principle is also very useful for managing graceful degradation of service. So DQ defines to total system capacity. So if a server is saturated, meaning that we have reached the limit of the server in terms of DQ. That's a constant. DQ's a constant. And so if you reach that limit, then we have a choice of graceful degrading the service from the point of view of the client. One possibility is we keep the harvest the same meaning that every client request that comes in has complete fidelity in terms of the answers returned by the server. So D is fixed. Q comes down because DQ is a constant. That's one option. The other option, of course, is to keep the volume of clients that are service, that is the yield Q to be a constant, but decrease the harvest. So the fidelity of the results returned to the users is less than. Because DQ is a constant, it allows us to gracefully degrade the service being provided by the server depending on the choice you want to make in terms of fidelity of the result or the yield that you want to provide the user community. In other words. The DQ principle gives us an explicit strategy for managing saturation. So as a system administrator, when we make these decisions on which to sacrifice and which to keep constant, we know how our decisions are going to affect the harvest, how it's going to affect the yield. How it is going to affect the up time and so on. So the choices that a system provider has or strategies that a system provider can use in structuring the servers, knowing that DQ is a constant, is when the server is saturated. They can do cost based admission control. You pay more, you get more. That may be one way to do it. Or priority or value based admission control. That may be another way to deal with service saturation or reduce data freshness. That is the harvest may be the same, but you reduce the fidelity of the data. For instance, if I'm serving videos, and I can serve the videos at different bit rates, if the server is saturated, I might decide to serve the video to all the users, all the videos that they want, but at a lower bit rate. So that is reducing the fidelity. Of the data that is being harvested. So that's the idea behind the DQ principle, how it might be used for graceful degradation of service when the server is saturated.
As a giant scale service provider, another thing that the service provider has to worry about is online evolution and growth of the service. because services are evolving continuously. The Gmail that you're accessing for getting your email today may not be the same that you accessed a month back. So, since the services are continuously evolving, the servers at the data centers have to be upgraded with a new version of the software or maybe new nodes are being added, or maybe all the old nodes are being retired, and brand-new nodes are being brought in to serve the user community. Whether it is hardware or software upgrade that needs to be done, that has to be done with the understanding that while that upgrade is happening, there is going to be loss of service. Again the DQ principal comes in handy in measuring what exactly is the loss that we're observing of the service when we do this kind of online evolution and growth. The first choice, is what is called fast reboot. That is, bring down all the servers at once, upgrade them, and then turn them back on. So this diagram is showing you, time on the x axis, and the loss, DQ loss, on the y axis, and what you're seeing here, this is the time for the upgrade per node. So the amount of time a node is going to be down in order to do that upgrade, whether it is a software upgrade or a hardware upgrade. And this y axis segment is the DQ loss per node. So if I bring down a node and upgrade it, it's going to be down for some amount of time. And this is the amount of server capacity that is lost, the DQ that is lost, as a result of one server being down. If all the servers are down, which is what is happening with fast reboot, then for this entire time, the area bounded by this green-shaded rectangle, we're going to have no service. The ideal is here in terms of the total amount of DQ we want, and the access is saying how much is the DQ that is lost and what we are saying is, if we do this fast reboot, that fast reboot is going to bring all the servers down for a certain amount of time that it takes to upgrade them, and for the entire duration, the service is unavailable. This idea of fast reboot is particularly useful in situations where we can use diurnal server property. For instance, if you think about services that are being all across the globe. We can use factors such as oh, this is nighttime for the user community and the users have probably gone night-night and therefore, they are not going to access the servers. This is a good time to bring down the whole service and upgrade the service. So that might be a situation where fast reboot is very useful. So we are assuming that the user community is segmented so that they're directed to geographically different locations, and we can use the diurnal server property to do this off-peak fast reboot of chosen replicas of the services. In this figure, so in this figure this segment that I'm showing you here is the DQ loss per node. And the total DQ losses, of course, n times DQ, where n is a number of servers that are undergoing this fast reboot. So in other words, for u units of time, there is complete loss of DQ with fast reboot. An alternative to fast reboot is what is called a rolling upgrade. Now here, what we're doing is, rather than bring all the servers down at the same time, we are bringing one server down at a time. And upgrading one server, bringing down the next server, upgrading it, and so on. This is what is called rolling upgrade or wave upgrade. And in this case, services available all throughout, there is no time that we are saying that the service is completely unavailable. But, for the entire duration of time, the duration of time here is n times u, because u is the upgrade time per node, and the upgrade, because it is a wave upgrade, is going to last for n times u time units, where n is a number of servers being upgraded in this fashion. But during the entire time, service is available, but in every u units of time, there's a DQ loss of this much bounded by this area during the entire upgrade process.
A third alternative is what is called a big flip. In the big flip, what we are doing is we are bringing down half the nodes at once. So in other words, the service is available at 50% of the full capacity. So you can see that with a big flip, half the nodes are down. So we have reduced the server capacity, the total DQ available by 50% for u units of time. And the total upgrade is going to last 2 times u because we have partitioned the entire server capacity in two halves, and upgrading one half, upgrading the second half for 2 times u time units we're not going to have full capacity. And during this 2 times U time unit we get the service at 50% capacity, and at the end of this it is back up to full capacity. So these are the three different choices that you've got, the fast reboot where the upgrade time for the entire service is just equal to the upgrade time for a single node. But here, we don't have the service available for any of the users for that period of time, and this might be a, a good strategy for services that can exploit the diurnal property of user community. Rolling upgrade is the more common one, where what we are doing is we are upgrading the servers one at a time but it's going to take a long time especially when you're talking about data centers having thousands and thousands of processors. Rolling upgrade is going to take a long time and it might be done in batches for instance instead of being exactly one at a time. And the extreme of that rolling upgrade is this big flip where we are saying we'll bring down 50% of the nodes, upgrade them, and then turn them back on, and then do the upgrade for the remaining 50%. So in the third case, in the big flip, the service is always available, but at 50% capacity for a certain duration of time. But there is no way to hide the DQ loss. And you'll see that the DQ loss, which is shown by the area of this shaded rectangle in each one of these cases, the area in the shaded rectangle is exactly the same. In other words, the DQ loss is the same for all three strategies, and it is the DQ loss of an individual node, the upgrade time for an individual node, and the number of servers, n, that you have in your server farm. That's the total DQ loss for online evolution, and all that these different strategies are saying is, as a system administrator, you have a choice on how you want to dish out the DQ loss. And make it apparent and not apparent to the user community. In this case, it's going to be apparent to the entire user community that there is upgrade going on. In this case, different segments of the user community are going to notice that the service has become unavailable for a short duration of time. And here it's sort of in between these two extremes and that half the user community may see a service being unavailable for some amount of time. So, in other words, using the DQ principle, maintenance and upgrades are controlled failures that can be handled by the administrator of a system service. So what we are seeing through this concept of DQ is a way by which a system developer and a system administrator can work together in figuring out how to architect the system in terms of how the data should be partitioned or replicated and how the system should fine-tune the service by looking at the instantaneous offered load and tuning whether to keep the yield the same or the harvest the same, knowing that DQ is a constant and the server is getting saturated. And finally, when service has to evolve once again, the system administrator can make an informed decision on how to do this online evolution by controlling the DQ loss that is experienced by the user community at any point of time.
The key insight is that giant scale services are network bound and not disk I/O bound. This is what helps in defining the DQ Principle. The DQ Principle really helps the system designer in optimizing either for yield or harvest for a given system capacity. Also, it helps the system designer in coming up with explicit policies for graceful degradation of services when either servers fail or load saturation happens or upgrades are planned.