This lesson is about games. Why games? Well, for one, games are fun. They've captured the imagination of people for thousands of years. They form a well-defined subset of the real world in that they have rules, which we understand and write down, and they are self-contained. They're not as messy as driving a car or flying an autonomous plane and having to worry about everything in the world. In that sense, they form a small-scale model of a specific problem. Namely, the problem of dealing with adversaries.
Along the way we've seen a lot of different technologies in this class and a lot of different techniques, that are focused at different parts of the agent and environment mix and different difficulties there. Here we have a quiz, and what I want you to tell me is for each of these technologies what do they most address? Some of them address more than one, but give the best answer for each line. Do they address the problem of a stochastic environment-- that is one where the results of actions can vary? Do they address the problem of a partially observable environment-- one where we can't see everything? Do they address the problem of an unknown environment-- one where we don't even know what the various actions are and what they do? Do they address computational limitations-- that is problems of dealing with a very large problem rather than a small one and making approximations to deal with that? Or do they deal with handling adversaries who are working against our goals? And I want you to answer that for MDPs, Markov decision processes; POMDPs, partially observable Markov decision processes and belief space; for reinforcement learning, and for A algorithm, heuristic function, and Monte Carlo techniques.
The answer is the MDPs are designed to do stochastic control. POMDPs are designed to deal with partial observability. Reinforcement learning deals with an unknown environment, and the heuristic function and A search and Monte Carlo techniques are used to deal with computational limitations. Monte Carlo techniques gives us an approximation. The heuristic function, if we use the right one, still gives us the right answer, but deals with the computational complexity. We don't as yet have any technology that's specifically designed to deal with adversaries.
What is a game? The philosopher Wittgenstein said that there is no single set of necessary and sufficient conditions that define all games. Rather games have a set of features, and some games share some of them, and other games share others of them. It's a complex overlapping set rather than a simple criteria. Here I've listed six different games and in some cases sets of games like Chess and Go are similar, Robotic Soccer, Poker, hide-and-go-seek played in the real world, Cards Solitaire, and Minesweeper, the computer solitaire game. I want to ask you, for each one, which of these properties they exhibit. Are they stochastic? Are they partially observable? Do they have an unknown environment? Are they adversarial? For each game tell me all that apply. Let me add that your answers may not be the same as mine, because these very terms are not that precise. Sometimes you can analyze a problem in two different ways and flip from one of these attributes to another, depending on how you analyze it.
Now, I've chosen to say that only robotic soccer and hide-and-go-seek are stochastic. By that I mean if you have an action like go forward 1 meter, the result of that action stochastic. You may not go forward exactly 1 meter. You could also analyze games like poker and cards and say that they're stochastic in that the next car is random, and so the action of flipping over the next card is stochastic. You don't know how that action is going to result. I've chosen to model that as partial observability. What I've said is it's not that you pick randomly from the next card, it's that the cards are already arranged in some order. It's just that you don't know what that order is. There's partial observability that gives you the next card. Partial observability also shows up in the real world sports or of robot soccer and hide-and-go-seek. Obviously, that's kind of the point of hide-and-go-seek that it's partially observable. Now, in terms of unknown, I've said that only hide-and-go-seek satisfies that. In everything else, the world is well-defined. Even in the real world in an environment like robot soccer, you only have the known field to deal with. Whereas in hide-and-go-seek, someone could be hiding anywhere in a room or location that you don't know about yet. Notice that many games are adversarial, but some games are not. Solitaire games are not adversarial. You could mark that down as saying, well, I'm playing against the game itself, but we don't count that as adversarial, because the games itself is not trying to defeat you. The game itself is passive. Whereas in these games and what adversarial has come to mean is that the opponent is taking into account what you are thinking when the opponent does their own thinking and tries to defeat you that way.
Here's a game that we've seen before. We call this a single-player deterministic game. We know how to solve this. We use the techniques of search through a state space--the problems solving techniques. We draw a search tree through the state space, and I'm going to draw the nodes like this with triangles rather than with circles. In any position--in this position here--there are three moves I can make. I can slide this tile, this tile, or this tile. So I have 3 moves, and that gives me 3 more states. I keep on expanding out the states going farther and farther down until I reach one that's a goal state, and then I have a path through there that gets me to a solution. What does it take to describe a game? Well, we have a set of states S, including a distinguished start state S0. We have a set of players P that can be our one player, as in this game, or two or more. We have a function that gives us the allowable actions in a state, and sometimes we put in a second argument, which is the player, in that state-making action, and sometimes it's explicit in the state itself whose turn it is to move. We have a transition function that tells us the result of, in some state, applying an action giving us a new state. And we have a terminal test to say is it the end of the game. That's going to be true or false. Finally, we have terminal utilities saying that for a given state and a given player there is some number which is the value of the game to that player. In simple games that number is a win or a loss, a one or a zero. Sometimes it's denoted as a +1 and a -1. In other games there can be more complicated utilities of you win twice as much or four times as much or whatever.
Now let's consider games like chess and checkers, which we define as deterministic, two-player, zero-sum games. The deterministic part is clear. The rules of chess say you make a move, take a piece, and that's it. There's no stochasticity. It's two players, one against another, and zero sum means that the sum of the utilities to the two players is zero. If one player gets a +1 for winning the game, the other player gets a -1 for losing. How do we deal with these types of games? Well, we use a similar type of approach. We have a state-space search. We have a starting state. There are some moves available to player one. Then in the next state there are moves available to player two. We're going to draw them like this, and we're going to give names to our players. The first player we're going to call Max, because it's a nice name, and because player one is trying to maximize the utility to player one. The next player, who operates at this level, we draw with a downward-pointing triangle. We call that player Min, because Min is trying to minimize the utility to Max, which is the same thing as trying to maximize the utility to himself or herself. Then we have a game tree that continues like that, alternating between Max and Min moves. Now, the search tree keeps going and let's say we get to a point where one player, and let's say it's Max, has a choice, and there are two states, and these, rather than being states where it's Min's turn, are states that are terminal. We'll draw them with a square box. Let's say one of them results in +1, a win for Max, and one of them results in -1, a loss for Max. Now if Max is rational, of course, Max is going to make this choice to the +1. What we're going to do now is show we can determine the value of any state in the tree, including the start state up here in terms of the values of the terminal nodes. The tree keeps on going. We assume it's a finite game. After a finite number of moves, every path leads to a terminal state. Then we look at each state and say whose turn is it to make the decision. In this state Max is making the decision, and Max, being rational, will choose the maximum value, saying, "I'd rather have a +1 than a -1, so I'll get a +1 here." We start going back up the tree, and maybe we get up to a point here where Min has a choice, and we've used this type of process to go up the tree, and Min has a choice between a +1 and a -1. Min is going to choose the minimum and will have a -1 here. If we go through all the possibilities, let's say these all result in -1, but this move results in a +1. Then Max will take that move. He'll say, "Out of my four possibilities, I know this is the best one. I'll take that move." Now we've done two things. One, we've assigned a value to every state in the search tree, and secondly, we backed that all the way up the top. Now we've worked out a path through that state to say, if all players are rational, here's the choices they would make. The important point here is that we've taken the utility function, which is defined only on terminal states. Here's a state here. The utility of that state was +1. Here's a state here. The utility of that state was -1. We've used those utility values in the definition of available actions to back those utilities up and tell us the utility of every state, including the start state.
Now let's define a function value of S, which tells us how to compute the value for a given state, and therefore will allow us to make the best possible move. If S is a terminal state, then the value is just the utility of the state given by the definition of the game. If S is a maximizing state, then we'll return something called max value of S, and if S is a minimizing state, then we'll return min value of S. Now we can define max value to just iterate over all the successors and figure out the values of each of those. We'll initialize a value m equals minus infinity, and then we'll say for all pairs of actions and successors states in successors of S, we'll say the value is--and let's call this S-prime so we don't get confused-- the value of S-prime and M for keeping track of the maximum so far and the new value. Then when we're all done we return the M with the maximum value. This will compute the maximum at a maximum node over all the states that we have from all the possible moves. The definition for min value is roughly equivalent but just reversed, taking the minimum instead. With these three recursive routines--value, max value, and min value-- we can determine the value of any node in the tree. Now to do that efficiently, you'd want a little bit of bookkeeping so you aren't recomputing the same thing over and over again, but conceptually, this will answer any two-player, deterministic, finite game.
Now we know we have an algorithm that can solve any game tree, that can propagate the terminal values back up to the top and tell us the value for any position. It's theoretically complete, but now we need to know the complexity of the algorithm to figure out if it's practical. Let's look at an analysis of how long it's going to take. Let's say that the average branching factor-- the number of possible moves or actions coming out of a position--is b. Here b would be 4. And let's say that the depth of the tree is m, so b wide and m deep. Now what I want you to tell me is what would be the computational complexity of searching through all the paths and backing the values up to the top. Would it be of the order of b times m or the order of be to the mth power or the order of m to the b power? Chose one of these.
The answer is we have b choice at the top level, and for each of those b we have another b at the next level. That would be b squared, b cubed and so on, and all the way to b to the mth power.
Now the next thing I want you to tell me is the space complexity. That was the time complexity. The space complexity is how much storage do we need to be able to search this tree. Remember that the value and max value and min value routines that we have defined are doing a depth-first search. Which of these would correctly represent the amount of storage that we would need-- the space complexity?
The answer is that we only need b times m space in order to do the search. Even though the entire tree is order b to the mth power of nodes, on any individual path through the tree we only need to look one path at a time in order to do the depth-first search. We generate these b nodes, store them away, look at the first one, generate b more, store those away, and so we're saving only b nodes at each level for m times level for total of b times m total storage space required.
The next question is let's look at the game of chess for which the branching factor is somewhere around 30. It varies from move to move. The length of a game is somewhere around 40. Certainly some games are much longer, but that's an average length of a game. Now let's imagine that you have a computer system, and you want to search through this whole tree for chess, and let's assume that you can evaluate a billion nodes a second on one computer. Let's also say that for the moment somebody lent you every computer in the world. If you have all the computers and they can each do a billion evaluations a second, how long would it take you to search through this whole tree? Would it be on the order of seconds, minutes, days, years, or lifetimes of the universe. Tell me which of these.
The answer is that it would take many lifetimes of the universe. Even though you have a lot of computing power at your disposal, that there is no chance of searching through the entire tree for chess.
Now our question is how do we deal with the complexity of having a tree with branching factor b and depth m. Here are some possibilities, and I want you to tell me which of these are good approaches. We have the problem of dealing with b to the m. Could we reduce b somehow, that is, reduce the branching factor, reduce m, the depth of the tree, or convert the tree into a graph in some way? Tell me which, if any or all, of these would be good approaches to dealing with the complexity.
Let's review just for a second. This is called the minimax routine for evaluating a game tree. Given a particular state we look and see is it a terminal state? Is it a maximizing state? It is a minimum state? In each case we look up the utility from the game. We do the max value routine, or we do the min value routine, which is similar. That gives us the value of each state. Then the action that the agent would take would be just to take the action that results in the maximum state--the state with the best value. Now let's try to apply the minimax routine to this game tree. This is a small game in which Max has three options for his moves, and then Min has three options for its moves, and then the game is over. Here are the terminal values for these states in terms of Max's score. What I want you to do is use minimax to fill in the values of these intermediate states. What are the values of these three states for Min to move, and what is the value of this state for Max to move?
The answer is that these are minimizing nodes. The minimum of 3, 12, and 8, is 3. Here the minimum is 2. Here it's 1. Then this is a maximizing move. The max is 3. That means that if both players played rationally, then Max would take this move. Then Min would take this move, and the value of the game would be 3.
Now I want to get at the idea of reducing b, the branching factor. How is it that we can cut down on the number of nodes that we expand in the horizontal direction while still getting the right answer for the evaluation of the tree? Let's go back and consider that during our evaluation, if we get to this point, we've expanded these three nodes, we figured out that the value of this one is 3, we looked at this one so far and found its value was 2, and now, without looking at these, what can we say about the value of this node? Well, it's a minimizing node, so the least it could be is 2. If these are less than 2, it'll be less than that, and if these are more, it'll end up being 2. We can say that the value of this node is less than or equal to 2. Now if we look at it from Max's point of view, Max will have this choice here of choosing either this, this, or this, and if this one is 3 and this one is less than or equal to 2, then we know Max will always choose this one. What that tells us is that it doesn't matter what the value is of this node and this node. No matter what those values are this is still going to be less than or equal to 2, and is not going to matter to the total evaluation, because we're going to go this way anyway. We can prune the tree, chop off these nodes here, and never have to evaluate. Now, with this particular case, that doesn't save us very much, because these are terminal nodes, but these could have been large branches-- big parts of the tree, and we still wouldn't have to look at them. We've made a potentially large pruning without effecting the value. We still get the exact correct value for the value of the tree.
Now I want you to tell me over here which, if any or all, of the three nodes can be pruned away by this procedure.
The answer is when we see the 14 we're not sure what this value is. It has to be less than or equal to 14, which means it might be the right path or it might not. Once we see the one then we know that the value is less than or equal to one, and we know that we have a better alternative here, so we can stop at that point. Then we can prune off the 8. Out of the three, only this node, the right-most, would be the one pruned away.
Now I'm going to look at the issue of reducing m, the depth of the tree. Here, I've drawn a game tree and left out some bits, but the idea is that is that it keeps on going and going. There'll be too many nodes for us to evaluate at all. What can we do? The simplest approach is to just by fiat cut off the search at a certain depth. We'll say we're only going to search to level three, and when we get down to level three, we're going to pretend that these are all terminal nodes. We'll draw them as the square boxes for terminals rather than the max nodes and cut off the search at that point. Now, of course, they aren't terminal, so according to the rules of the game, we haven't either won or lost at this particular point. We can't say for sure what the value is for each of these nodes, but we can estimate it using something called an evaluation function, which is given a state S and returns an estimate of the final value for that state. What do we want out of our evaluation function and how do we get it? We want the evaluation function to be stronger for positions that are stronger and weaker for positions that are weaker. We can get it one way from experience-- from playing the games before and seeing similar situations and figuring out what their values are. We can try to break that down into components by using experience with the game. For example, in the game of chess it is traditional to say that a pawn is worth 1 point, a knight 3 points, a bishop 3 points, a rook 5, and a queen 9. You could add up all those points. So we could have an evaluation function of S which is equal to this weighted sum of the various weights times the various pieces-- positive weights for your pieces and negative weights for the opponent's pieces. We've seen this idea before when we did machine learning where we have a set of features, which could be the pieces, and they could be other features of the game as well. For example, in chess it's good to control the center, it's good not to have a double pawn, and so on. We could make up as many features as we can think of to represent each individual state and then use machine learning from examples to figure out what the weight should be. Then we have an evaluation function. We apply the evaluation function to each state at the cutoff point rather than doing a long search. Then we have an estimate, and we back those values up just as if they were terminal values.
Now let's see how we can compute the value of a state using these two innovations to work on b and m. I've modified our routine for value in two ways-- one, I've introduced a new line that says if we decide to cut off the search at a particular depth then apply the evaluation function to the state and return that. Then I've also added some bookkeeping variables. One for the current depth, which will get increased as we go along, and then two values called alpha and beta, which are the traditional names, where alpha is the best value found so far for Max along the path that we are currently exploring, and beta is the best value found so far for Min. Then since we have these extra parameters when we start out, we would make the call value of our initial state S0 and we're currently at depth zero in the search tree, and we haven't found the best for Max yet so that would be minus infinity, and the best for Min similarly we haven't found anything there so that would be plus infinity. We call that and then each node we would chose one of these four cases. Here's the new definition of maxValue taking the depth and the alpha and beta parameters into account. It's similar to what we had before. We go through all the successors. We take the maximum, and in this case we're incrementing the depth as we call recursively for the value of each node. We get the cutoff here if we exceed beta, and otherwise we retain alpha as the maximum value to Max so far. Then we return the final value.
Now we said we have three ways to reduce this exponential b to the m-- reducing the branching factor b, reducing the depth of the tree m, and converting the tree to a graph Let's see how each of those fare. First, for reducing b we came up with this alpha-beta pruning technique. In fact, that is very effective. That takes us from a regime where we're in order b to the m to one where, if we do a good jog, we can get to order b to the m/2. Now what do I mean by doing a good job? Well, we get different amounts of pruning depending on the order in which we expand each branch from a node. If we expand the good nodes first, then we get a lot of pruning, because we do a good job of getting to the cutoff points. If we expand the poor nodes first, then we don't do any pruning, because we don't get to that cutoff point until later. But if we can do well, then we get to the square root of the number of nodes. In other words, we get to search twice as deep into the search tree. That's all 100% perfect in terms of not changing the result. We'd still get the exact evaluation. We just stop doing work that we didn't have to do. Now for the tree to the graph, we haven't talked that yet. In fact, it depends on the particular game, but in many games it can be very useful. In games like chess, we have opening books. That is, we look at the past openings and we just memorize those positions and what are the good moves. It doesn't matter how we get to those positions. We can get to them in multiple paths through a tree, and we can just consider it a single graph. We also have closing books, where we can memorize all the positions with five or fewer pieces and know exactly what to do. In the midgame when there are too many positions to memorize all of them, we can still search through a graph if we want to or if we want we can just do part of that. One thing that has proven effective in games like chess is called the killer-move heuristic. What that says is if there's one really good move in part of a search tree, then try the other move in the sister branches for that tree. In other words, if I try making one move and I find that the opponent takes my queen, then when I try making another move from that same position, I should also check if the opponent has that response of taking my queen. Converting from a tree to graph, also doesn't lose information. It can just help us make the search go faster. The third possibility was reducing m, the depth of the tree, by just cutting off search and going to an evaluation function. That is imperfect in that it is an estimate of the true value of the tree but won't give you the exact value. We can get into trouble. Let me show you an example of that.
Here's a search tree for a version of Pacman in which there's only four squares. There's a little Pacman guy who can move around, and there are food dots that the Pacman can eat. Maybe someplace else in the maze there are opponents, but we're not going to worry about them right here. We're just going to consider the Pacman's actions. He has two actions--to go left or right If he goes left, he goes over here and eats that food particle and then moves back right--that's his only move from that position. Or if he moves right then he has two other moves. Now let's assume that we cut off the search at this depth, and we want to have an evaluation function, and the goal is for Pacman to eat all the food. The evaluation function will be the number of food particles that he's eaten so far. What I want you to do is tell me in these boxes what the evaluation should be for each of these three states.
The answer is here he's eaten 1, here he's eaten 0, and here he's eaten 1. That's fine. The problem arises when we start backing up these numbers. If these are max nodes, we've skipped the opponent's moves, which are the min nodes. We're only looking at the maxes. The max of 1 is 1, so this would also get an evaluation of 1. The max of 0 and 1 is 1, so this would also get an evaluation of 1. This final node would be the max of 1 and 1, so that's also 1. But now when we go to apply the policy, if we're in this position, using these evaluation functions, both of these moves are equally good. The Pacman might choose this one, choosing at random or choosing by some predefined ordering. Then he'd end up in this state. So far he hasn't eaten anything. But this state is just as good because he knows in two moves he can eat one particle going this way just as well as in two moves he can eat one particle going this way. Now he's in this state, but notice that this state is symmetric to this one. On his next turn, if we did another depth-two search, he might just as well go back one position. He would be stuck going back and forth between these two states, because either one of those, if you look ahead only two, is equally good. You have to look ahead one, two, three, four moves to know that one of them is better than the other. This is known as the horizon effect. The idea is that when we cut off search we're specifying a horizon beyond which the agent can't see. If a good thing or a bad thing happens beyond the horizon, we don't see that. All we see is whatever is reflected in the evaluation function. If the evaluation function is imperfect, we don't see beyond the horizon, and we can make mistakes.
There is one more thing to deal with when we have to talk about games--and that's chance. We want to move from purely deterministic games to stochastic games like backgammon or other games that introduce dice or other parts of random action. That means that the actions that an agent takes are not specifically specified to have a single result. Let's see how we can deal with stochastic games by looking at our value function and modifying it to allow for this. Here we have our valuation function, and we're dealing with four types of nodes-- one, nodes that we decide to cut off on our own, because we reached a certain depth; second, nodes that are terminal according to the rules of the game; and third, max to move and min to move. Now I'm going to add one more type, which is a chance node. We say if the state is a chance node, then we want to return the expected value of S and carry along these bookkeeping variables. What we're saying here is if it's at the point of the game where it's time to roll the dice, then we're going to role the dice, and we're going to take the expected value of all the possible results rather than the max or the min. Here we have a schematic diagram for a stochastic game--a game with dice. We start out. The chance node or the dice-rolling node is first. The dice is rolled--one of six possibilities. Then the next player gets his move. In this case, we've let Min move first, and Min has various moves possible to make. For each one, there is then another role of the dice, and then Max gets to make his move.
Here's the game tree for another stochastic game. This game involves flipping a coin. The chance nodes have two results: heads or tails. Then the player Max has two possible moves, A and B, and the player Min has two possible moves, C and D. This game is too small to have any alpha-beta pruning involved, but I've listed all the terminal values for the terminal states of the game. What I want you to do is fill in the non-terminal values for the chances nodes, the max nodes, and the min nodes according to the rules of minimum value and maximum value and expected value. I should say that the probability of the coin flip is 50% heads and 50% tails.
To evaluate the game tree, we work from the bottom up. Let's start over here. This is a min node. Min chooses the minimum, which will be 1. In this position, Min would choose 2, the minimum of 2 and 4. Over here Min would choose 0, the minimum of 0 and 10. Now we have some chance nodes, so we have to choose the expected value. Chance, the flip of the coin, doesn't get the choice of one direction or the other. Rather both of them are possibilities. So we just average the results, since the probability of heads and tails are equal. So 7 and 1 is 8, divided by 2 is 4, and 8 and 2 is 10 over 2 is 5 is the expected value there. The expected value of 0 and 6 is 3, and the expected value of 0 and 4 is 2. Now we have a maximizing node. The max of 5 and 4 would be 5. The max of 3 and 2 would be 3, and finally, we have another chance node. The average of 5 and 3 would be 4, and that's the value of the final state.
Now one more question for this same game tree. I want you to click on all the terminal states that are possible outcomes for this game if both players play rationally.
One more quick game tree to evaluate. Here we have terminal values. We have chance nodes where the two options are equiprobable. We have a max node. The two actions A and B. I want you to fill in the values for all the nodes and click on which action, A or B, is the rational action for Max.
The average of these two is 2. That's a chance node. The average of these is 2.5. Max will choose the better of 2 and 2.5, which is 2.5. Therefore, B will be the rational action for Max.
Now we know if this game if these were terminal nodes, then that would be the right action for the game, and there was nothing to argue about. But what if instead of having these be terminal nodes, these were cutoff nodes, and these were evaluation values for those nodes? Furthermore, if it's an evaluation function, then it's an arbitrary function. Suppose if instead of coming up with these values, we used a different evaluation function, which squared these values, and so we came up with evaluations of 0, 16, 4, and 9. With that function, I want you to repeat the problem of filling in the values for each of these nodes and tell me what the rational action is for Max.
The answer is this is a chance node so we take the average of 0 and 16. That's no longer 2. It becomes 8. We take the average of 9 and 4. That's no longer 2.5. It becomes 6.5. Notice what's happened now is Max now chooses 8 over 6.5. and now the rational action has shifted from B to A. What's gone on here? We notice that just by making a change to the evaluation function, we changed the rational action.
Let's summarize what we've done so far. We've built up this valuation function that tells us the value of any state, and therefore we can choose the best action in a state. We started off just having terminal states and max value states. That's good for one-player, deterministic games, and we realized that that's just the same thing as searches we've seen before where we had A search or depth-first search. Then we added in an opponent player for two-player or multiplayer games, which is trying to minimize rather than maximize. We saw how to do that. Then we optimized by saying at some point we may not be able to search the whole tree, so we're going to have a cutoff depth and an evaluation function. We recognized that that means that we're no longer perfect in terms of valuating the tree. We now have an estimate. We also tried to be more computationally effective by throwing in the alpha and beta parameters, which keep track of the best value so far for Max and Min and prune off branches of the tree that are outside of that range that are provably not part of the answer for the best value. We kept track of those through these bookkeeping parameters. Then finally we introduced stochastic games, in which there is an element of chance or luck or rolling of the dice. We realized that in order to valuate those nodes, we have to take the expected value rather than the minimum or the maximum value. Now we have a way to deal with all the popular types of games. The details now go into when we figure out to cut off and what's the right evaluation function. Those are a complex area. A lot of research in AI is being done in that, but it's being done for specific games rather than for the theory in general.