Hi--welcome back. You just learned how Markov Decision Processes can be used to determine an optimal sequence of actions for an agent in a stochastic environment. And that is, an agent that knows the correct model of the environment can navigate, finding its ways to the positive rewards and avoiding the negative penalties. But it can only do that if he knows where the rewards and penalties are. In this lesson, we'll see how a technique called reinforcement learning can guide the agent to an optimal policy, even though he doesn't know anything about the rewards when he starts out.
For example, in the 4 by 3 GridWorld, what if we don't know where the plus 1 and minus 1 rewards are when we start out? A reinforcement learning agent can learn to explore the territory, find where the rewards are, and then learn an optimal policy. Whereas, an MDP solver can only do that once it knows exactly where the rewards are. Now, this idea of wandering around and then finding a plus 1 or a minus 1 is analogous to many forms of games, such as backgammon-- and here's an example: backgammon is a stochastic game; and at the end, you either win or lose. And in the 1990s, Gary Tesauro at IBM wrote a program to play backgammon. His first attempt tried to learn the utility of a Game state, U of S, using examples that were labelled by human expert backgammon players. But this was tedious work for the experts, so only a small number of states were labelled. The program tried to generalize from that, using supervised learning, and was not able to perform very well. So Tesauro's second attempt used no human expertise and no supervision. Instead, he had 1 copy of his program play against another; and at the end of the game, the winner got a positive reward, and the loser, a negative. So he used reinforcement learning; he backed up that knowledge throughout the Game states, and he was able to arrive at a function that had no input from human expert players, but, still, was able to perform at the level of the very best players in the world. He was able to do this, after learning from examples of about 200,000 games. Now, that may seem like a lot-- but it really only covers about 1 trillionth of the total state space of backgammon. Now, here's another example: This is a remote controlled helicopter that Professor Andrew Ng at Stanford trained, using reinforcement learning; and the helicopter--oh--oh, sorry-- I made a mistake--I put this picture upside down because--really, Ng trained the helicopter to be able to fly fancy maneuvers--like flying upside down. And he did that by looking at only a few hours of training data from expert helicopter pilots who would take over the remote controls, pilot the helicopter--and those would all be recorded-- and then, you would get rewards from when it did something good, or when it did something bad; and Ng was able to use reinforcement learrning to build an automated helicopter pilot, just from those training examples. And that automated pilot, too, can perform tricks that only a handful of humans are capable of performing. But enough of this still picture--let's watch a video of Ng's helicopters in action. [Stanford University Autonomous Helicopter] [sound of helicopter flying] [Chaos] [Stanford University Autonomous Helicopter]
Let's stop and review the 3 main forms of learning. We have supervised learning, in which the training set is a bunch of input/output pairs-- X1,Y1; X2, Y2; et cetera-- in which we try to produce a function: y equals f of x-- and so the learning is producing this function, f. Then we have unsupervised learning, in which we're given just a set of data points-- X1, X2, and so on-- and each of these points, maybe, has many dimensions, many features. And what we're trying to learn is some patterns in that-- some clusters of these data-- or you could just say what we're trying to learn is a probability distribution or what's the probability that this random variable will have particular values; and learn something interesting from that. In this lesson, we're introducing the third type of learning-- reinforcement learning-- in which we have a sequence of action and state transitions. So: state and action, state and action--and so on. And at some point, we have some rewards associated with these. So there's a reward, and maybe not a reward for this state; and then another reward for this state-- and the rewards are just scalar numbers, positive or negative numbers. What we're trying to learn here is: at optimal policy, what's the right thing to do in any of the states?
Let's show some examples of machine learning problems and I want you to tell me, for each one, whether it's best addressed with supervised learning, unsupervised learning, or reinforcement learning. And the first example is speech recognition-- where I have examples of voice recordings, and then the transcript's intermittent text for each of those recordings; and from them, I try to learn a model of language. Is that supervised, unsupervised or reinforcement? Next example is analyzing the spectral emissions of stars and trying to find clusters of stars in dissimilar types that may be of interest to astronomers. Would that be supervised, unsupervised or reinforcement? The data here would just consist of: for each star, a list of all the different emission frequencies of light coming to earth. Next example is lever pressing. So--I have a rat who is trained to press a lever to get a release of food when certain conditions are met. Is that supervised, unsupervised or reinforcement learning? And finally, the problem of an elevator controller. Say I have a bank of elevators in a building and they have to have some program--some policy-- to decide which elevator goes up and which elevator goes down in response to the percepts, which would be the button presses at various floors in the building. And so, I have a sequence of button presses, and I have the wait time that I am trying to minimize-- so after each button press, the elevator moves; the person waiting is waiting for a certain amount of time, and then gets picked up, and the algorithm is, given that amount of wait time. Would that be supervised, unsupervised or reinforcement?
The answers are that speech recognition can be handled quite well by supervised learning. That is, we have input/output pairs; the input is the speech signal, and the output is the words that they correspond to. Analyzing the spectral emissions of stars is an example of unsupervised clustering where we're taking the input data-- we have data for each star, but we don't have any label associated with it. Rather, we're trying to make up labels by clustering them together, giving them to scientists, and then letting the scientists see: Do these clusters make any sense? Lever pressing is a classic example of reinforcement learning. In fact, the term "reinforcement learning" was used for a long time in animal psychology, before it was used in computer science. And elevator controllers is another area that has been investigated, using reinforcement learning and, in fact, very good algorithms-- better than the previous state of the art-- have been made, using reinforcement learning techniques. So the input, again, is a set of state/action transitions; and then the reinforcement is-- in this case, it's always a negative number because there's always a wait time. And so that's the penalty--we're trying to minimize that penalty, but all we get is the amount of wait time that we're trying to minimize.
Now, before we get into the math of reinforcement learning, let's review MDPs-- which are, of course, Markov Decision Processes. An MDP consists of a set of states-- S is an element of the state, S; a set of actions-- A is an element of the actions that are available in each of the states, S. And we're going to distinguish a Start state, which we'll call S-zero, and then we need a transition function that says: How does the world evolve as we take actions in the world? And we can denote that by the probability that we get a Result state, S prime-- given that we start in state, S, and apply action, A. That's a probability distribution because the world is stochastic. The same result doesn't happen every time, when we do the same action, so we have this probability distribution. In some notations, you'll see: T of S, A, S--for the transition function. And then, in addition to the transition, we need a reward function-- which we'll denote R. Sometimes that's over the whole triplet-- the reward that you get from starting in one state, taking an action, and arriving at another state; sometimes we only need to talk about the result state. So in the 4 by 3 Grid World, for example, we don't care how you got to this state-- it's just, when you get to one of the states in the upper right, you get a plus 1 or minus 1 reward. And similarly, in a game like backgammon-- when you win or lose, you get a positive or negative reward. It doesn't matter what move you took to win or lose. And so that's all there is to MDPs.
Now to solve an MDP, we're trying to find a policy--pi of S-- that's going to be our answer. The pi that we want--the optimal policy-- is the one that's going to maximize the discounted, total Reward. So what we mean is: we want to take the sum over all Times into the future of the Reward that you get from starting out in the state that you're in, in time T-- and then applying the policy to that state, and arriving at a new state, at time T plus 1. And so we want to maximize that sum-- but the sum might be infinite and so, what we do is we take this value, Gamma, and raise it to the T power, saying we're going to count future Rewards less than current Rewards--and that way, we'll make sure that the sum total is bounded. So we want the policy that maximizes that result. If we figure out the utility of the state by solving the Markov Decision Process, then we have: the utility of any state, S, is equal to the maximum over all possible actions that we could take in S of the expected value of taking that action. And what's the expected value? Well, it's just the sum over all resulting states of the transition model-- the probability that we get to that state, given from the start state, we take an action specified by the optimal policy times the utility of that resulting state. So--look at all possible actions; choose the best one-- according to the expected, in terms of probability utility.
Now here's where reinforcement learning comes into play: What if you don't know R--the Reward function? What if you don't even know P--the transition model of the world? Then you can't solve the Markov Decision Process because you don't have what you need to solve it. However, with reinforcement learning, you can learn R and P by interacting with the world or you can learn substitutes that will tell you as much as you know, so that you never actually have to compute with R and P. What you learn, exactly, depends on what you already know and what you want to do. So we have several choices. One choice is we can build a utility-based agent. So we're going to list agent types, based on what we know, what we want to learn, and what we then use once we've learned. So for a utility-based agent, if we already know T, the transition model, but we don't know R, the Reward model, then we can learn R--and use that, along with P, to learn our utility function; and then go ahead and use the utility function just as we did in normal Markov Decision Processes. So that's one agent design. Another design that we'll see in this Lesson is called a Q-learning agent. In this one, we don't have to know P or R; and we learn a value function, which is usually denoted by Q. And that's a type of utility but, rather than being a utility over states, it's a utility of state action pairs--and that tells us: For any given state and any given action, what's the utility of that result-- without knowing the utilities and rewards, individually? And then we can just use that Q directly. So we don't actually have to ever learn the transition model, P, with a Q-learning agent. And finally, we can have a reflex agent where, again, we don't need to know P and R to begin with; and we learn directly, the policy, pi of S; and then we just go ahead and apply pi. So it's called a reflex agent because it's pure stimulus response: I'm in a certain state, I take a certain action. I don't have to think about modeling the world, in terms of: What are the transitions--where am I going to go next? I just go ahead and take that action.
Now, the next choice we have in agent design revolves around how adventurous he wants to be. One possibility is what's called the passive reinforcement learning agent-- and that can be any of these agent designs, but what passive means is that the agent has a fixed policy and executes that policy. But it learns about the reward function, R, and maybe the transition function, P, if it didn't already know that. It learns that while executing the fixed policy. So let me give you an example. Imagine that you're on a ship in uncharted waters and the captain has a policy for piloting the ship. You can't change the captain's policy. He or she is going to execute that, no matter what. But it's your job to learn all you can about the uncharted waters. In other words, learn the reward function, given the actions and the state transitions that the ship is going through. You learn, and remember what you've learned, but that doesn't change the captain's policy-- and that's passive learning. Now, the alternative is called active reinforcement learning-- and that's where we change the policy as we go. So let's say, eventually, you've done such a great job of learning about the uncharted water that the captain says to you, "Okay--I'm going to hand over control and as you learn, I'm going to allow you to change the policy for this ship. You can make decisions of where we're going to go next." And that's good, because you can start to cash in early on your learning and it's also good because it gives you a possibility to explore. Rather than just say: What's the best action I can do right now?-- you can say: What's the action that might allow me to learn something-- to allow me to do better in the future?
Let's start by looking at passive reinforcement learning. I'm going to describe an algorithm called Temporal Difference Learning--or TD. And what that means--sounds like a fancy name, but all it really means is we're going to move from one state to the next; and we're going to look at the difference between the 2 states, and learn that--and then kind of back up the values, from one state to the next. So if we're going to follow a fixed policy, pi, and let's say our policy tells us to go this way, and then go this way. We'll eventually learn that we get a plus 1 reward there and we'll start feeding back that plus 1, saying: if it was good to get a plus 1 here, it must be somewhat good to be in this state, somewhat good to be in this state--and so on, back to the start state. So, in order to run this algorithm, we're going to try to build up a table of utilities for each state and along the way, we're going to keep track of the number of times that we visited each state. Now, the table of utilities, we're going to start blank-- we're not going to start them at zero or anything else where they're just going to be undefined. And the table of numbers, we're going to start at zero, saying we visited each state a total of zero times. What we're going to do is run the policy, have a trial that goes through the state; when it gets to a terminal state, we start it over again at the start and run it again; and we keep track of how many times we visited each state, we update the utilities, and we get a better and better estimate for the utility. And this is what the inner loop of the algorithm looks like-- and let's see if we can trace it out. So we'll start at a start state, we'll apply the policy--and let's say the policy tells us to move in this direction. Then we get a reward here, which is zero; and then we look at it with the algorithm, and the algorithm tells us if the state is new--yes, it is; we've never been there before-- then set the utility of that state to the new reward, which is zero. Okay--so now we have a zero here; and then let's say, the next step, we move up here. So, again, we have a zero; and let's say our policy looks like a good one, so we get: here, we have a zero. We get: here, we have a zero. We get: here--now, this state, we get a reward of 1, so that state gets a utility of 1. And all along the way, we have to think about how we're backing up these values, as well. So when we get here, we have to look at this formula to say: How are we going to update the utility of the prior state? And the difference between this state and this state is zero. so this difference, here, is going to be zero-- the reward is zero, and so there's going to be no update to this state. But now, finally--for the first time--we're going to have an actual update. So we're going to update this state to be plus 1, and now we're going to think about changing this state. And what was its old utility?--well, it was zero. And then there's a factor called Alpha, which is the learning rate that tells us how much we want to move this utility towards something that's maybe a better estimate. And the learning rate should be such that, if we are brand new, we want to move a big step; and if we've seen this state a lot of times, we're pretty confident of our number and we want to make a small step. So let's say that the Alpha function is 1 over N plus 1. Well, we'd better not make it 1 over N plus 1, when N is zero. So 1 over N plus 1 would be ½; and then the reward in this state was zero; plus, we had a Gamma-- and let's just say that Gamma is 1, so there's no discounting; and then we look at the difference between the utility of the resulting state--which is 1-- minus the utility of this state, which was zero. So we get ½, 1 minus zero--which is ½. So we update this; and we change this zero to ½. Now let's say we start all over again and let's say our policy is right on track; and nothing unusual, stochastically, has happened. So we follow the same path, we don't update--because they're all zeros all along this path. We go here, here, here; and now it's time for an update. So now, we've transitioned from a zero to ½-- so how are we going to update this state? Well, the old state was zero and now we have a 1 over N plus 1-- so let's say 1/3. So we're getting a little bit more confident--because we've been there twice, rather than just once. The reward in this state was zero, and then we have to look at the difference between these 2 states. That's where we get the name, Temporal Difference; and so, we have ½ minus zero-- and so that's 1/3 times ½-- so that's 1/6. Now we update this state. It was zero; now it becomes 1/6. And you can see how the results of the positive 1 starts to propagate backwards--but it propagates slowly. We have to have 1 trial at a time to get that to propagate backwards. Now, how about the update from this state to this state? Now, we were ½ here--so our old utility was ½; plus Alpha--the learning rate--is 1/3. The reward in the old state was zero; plus the difference between these two, which is 1 minus ½. So that's ½ plus 1/6 is 2/3. And now the second time through, we've updated the utility of this state from 1/2 to 2/3. And we keep on going--and you can see the results of the positive, propagating backwards. And if we did more examples through here, you would see the results of the negative propagating backwards. And eventually, it converges to the correct utilities for this policy.
Now here are some results from running the passive TD algorithm on the 4 by 3 maze. On the right, we see a graph of the average error in the utility function--average across all the states. So it starts off--for the first 5 or so trials, the error rate is very high--it's off the charts. But then it starts to settle down, through 10, 20, 40; and up to about 60 or so, it's still improving; and then it gets to a final steady state after about 60 trials of about .05 in the average error in utility. So that's not too bad, but not really converging all the way down to no rate of error. And on the left, you see the utility estimates for various different states; and, as we see--as we get out to 500 trials, they're starting to converge a little bit, close to their true values. But we see in the first 100 or so trials-- they were all over the map, and so it wasn't doing very well. It took awhile for it to converge to something close to the true values.
Now I want to do a little quiz, and ask you: True or False, which of the following are possible weaknesses in this TD learning with a passive approach to reinforcement learning? One: Is it possible that we would have a long convergence time-- that it might take a long time to converge to the correct utility values? Secondly, are we limited by the policy that we choose? So remember: in passive reinforcement learning, we choose a fixed policy and execute that policy; and any deviance from the policy results from the stochasticity. We may visit different squares because the environment is stochastic, but not because we made different choices. So there's that elementation. Third, can there be a problem with missing states? That is, could there be some states that have a zero count--that we never visited, and never got a utility estimate? And fourth, could there be a problem with a poor estimate for certain states? So could it be that, even though a state didn't have a count of zero, it had a low count, and we weren't able to get a good utility estimate for that state?
An answer is that every one of these is a potential problem for passive reinforcement learning. So every problem won't show up in every possible domain. It'll depend on what the environment looks like. But it is a possibility that you could get bitten by any of these problems. And they all stem from the same cause, from the fact that passive learning stubbornly sticks to the same policy throughout. We have a policy, pi of S, and we always execute that policy. So if the policy here was to go up and then go right, then we would always stick to that; and the only time we would explore any other state is when those actions failed. If we tried to go up from this state-- because that's what the policy said; but, stochastically, we slipped over to this state-- then we wouldn't do something else, according to the policy and so we'd get a little bit of exploration, but we'd only vary from the chosen path because of that variation and we wouldn't intentionally explore enough of the space.
So let's move on to Active Reinforcement Learning and, in particular, let's examine a simple approach called a Greedy Reinforcement Learner. And the way that works is it uses the same passive TD learning algorithm that we talked about, but, after each time we update the utilities or maybe after a couple of updates--you can decide how often you want to do it-- after the change to the utilities, we recompute the new optimal policy, pi. So we throw away our old pi, pi1, and replace it with a new pi, pi2-- which is a result of solving the MDP described by our new estimates of the utiliities. Now we have a new policy, and we continue learning with that new policy. And so, if the initial policy was flawed, the Greedy algorithm would tend to move away from the initial policy, towards a better policy--and we can show how well that works.
Here's the result of running the Greedy agent over 500 trials. And I've graphed 2 things here: One is the error; and you see, over the top-- over the first 40 or so trials-- the error was very high--way up here. But then, suddenly, it jumped down to a lower level, and stayed along that level all the way through to 500. I've also graphed, with a dotted line, the policy loss. What does that mean?--so that's the difference between the policy that the agent has learned and the optimal policy. So if it had learned the optimal policy, the policy loss would be zero, down here. It doesn't quite get to zero. It was high, up here, and then at around step 40, it learned something important. What did it learn?--well, here's the final policy that it came up with. Maybe it started originally going in this direction in hitting the minus 1; and then it flipped and learned a new policy that went in a better direction. But it still hasn't learned the optimal policy. And we can see--for example, this looks like a mistake here. In state 1-2, it's policy is moving down and then following this path, which it learned, towards the goal. But really, a better route would be to take the northern route, and go through this path. But it hasn't learned that. Because it was Greedy, it found something that seemed to be doing good for it, and then it never deviated from that.
So the question, then, is: How do we get this learner out of its rut? It improved its policy for awhile, but then it got stuck in this policy where we go here, go up and then go right. Most of the time, that's a perfectly good policy. But if a stochastic error makes us slip into the minus 1, then it hurts us. We'd like to be able to say we're going to stop doing that and somehow find this route. But in order to find that new route, we'd have to spend some time executing a policy which was not the best policy known to us. In other words, we'd have to stop exploiting the best policy we'd found so far--which is this one-- and start exploring, to see if maybe there's a better policy. And exploring could lead us astray and cause us to waste a lot of time. So we have to figure out: what's the right trade-off? When is it worth exploring to try to find something better for the long term-- even though we know that exploring is going to hurt us in the short term? Now, one possibility is, certainly, random exploration. That is, we can follow our best policy some percentage of the time, and then randomly, at some point, we can decide to take an action which is not the optimal action. So we're here, the optimal action would be to go east; and we say, "Well, this time we're gong to choose something else-- let's try going north. And then we explore from there and see if we've learned something. So that policy does, in fact, work-- randomly making moves with some probability--but it tends to be slow to converge. In order to get something better, we have to really understand what's going on with our exploration, versus exploitation.
So let's really think about what we're doing when we're executing the active TD learning algorithm. First, we're keeping track of the optimal policy we've found so far; and that gets updated as we go, and replaced with new policies. Secondly, we're keeping track of the utilities of states-- and those, too, get updated as we go along. And third, we're keeping track of the number of times that we visited each state. And that gets incremented on each trial. Now, what could happen? What could go wrong? There are really 2 reasons why our utility estimates could be off. First, we haven't sampled enough. The end values are too low for that state and the utilities that we got were just some random fluctuations and weren't a very good, true estimate. And secondly, we could get a bad utility because our policy was off. The policy was telling us to do something that wasn't really the best thing, and so the utility wasn't as high as it could be. So let's do a little quiz. I want you to tell me, for the 2 sources of possible error-- too little sampling and wrong policy-- I want you to tell me, is it True or False--each of these statements: One: Could the error--either the sampling error or the policy error-- could that make the utility estimates too low? And secondly, could it make utility too high? And third, could it be improved with higher N values--that is, more trials?
And here are the answers: For the error introduced by a lack of enough sampling, all these problems are true. If you don't have enough samples, it might make the utility too high; it might make the utility too low-- and it could certainly be improved by taking more trials. But with the differences due to having not quite the right policy, The answers aren't the same. So yes, if you don't have the right policy, that could make the utilities too low--if you're doing something silly, like starting in this state and the policy says, "Drive straight into the minus 1" that could make the utility of this state lower than it really should be. But it can't make the utility too high. So we really have a bound on the utility here. The bound is: what does the optimal policy do? And no matter what policy we have, it's not going to be better than the optimal policy; and so we can only be making things worse with our policy, not making them better. And finally, having more N won't necessarily improve things. It will decrease the variance, but it won't decrease or improve the mean.
Now what that suggests is the design for an exploration agent that will be more proactive about exploring the world when it's uncertain, and will fall back to exploiting the optimal policy--or whatever policy it has as close to optimal-- when it becomes more certain about the world. And what we can do is go through this normal cycle of TD learning-- like we always did. But when we're looking for the estimate of the utility of the state, what we can do is say: The utility of the state estimate will be some large value, plus R-- say, plus 1--in the case of this example-- the largest reward we can expect to get. In every case, when the number of visits to the state is less than the sum threshold, E, the exploration threshold. And when we've visited a state E times, then we revert to the learned probabilities or the learned utilities, rather. So when we start out, we're going to explore from new states; and once we have a good estimate of what the true utility of the state actually is, then we stop exploring and we go with those utilities.
And here we have the result of some simulations of the exploratory agent. We see it's doing much better than the passive agent or than the Greedy agent. So I'm graphing here; and we only had to go through 100 trials. We didn't have to go through 500--so it's converging much faster. And it's converging to much better results. So the policy loss and the dotted lines started off high; but after only 20 trials, it's come down to perfect. So it learned the exact, correct policy after 20 trials. The error in the utilities--so you can have the perfect policy, while not quite having the right utilities for each state-- and the errors in utility comes down, and that, too, comes down to a level that's lower than the previous agent's-- but still, not quite perfect. And we see here that it, in fact, learns the correct policy.
Now, let's say we've done all this learning, we've applied our agent, and we've come up with a utility model; and we have the estimates for the utility for every state. Now what do we do when we want to act in the world? Well, we now have out policy for the state, which is determined by the expected value and we compute the expected value of each state by looking at the utility, which we just learned. But then, we have to multiply by the transition possibilities. What's the probability of each resulting state that we have to look up the utility of? And so, we need to know that-- and in some cases, we're given the transition model, and so we know all these probabilities. But in other cases, we don't have it; and so if we haven't learned it, we can'tapply our policy, even though we know the utilities. I want to talk, briefly, about this alternative method called Q Learning, that I mentioned before. Where in Q Learning, we don't learn U direclty, and we don't need the transition model. Instead, what we learned is a direct mapping, Q, from states and actions to utilities and so then, once we've learned Q, we can determine the optimal policy of he state, just by taking the maximum overall possible actions of this Q of S, A values.
Now, how do we do Q Learning? Well, we start off with this table of Q values-- and notice that there's more entries in this table than there were in the utility table. So for each state, I've divided it up into different actions--so here's the action of going north, south, east or west from this particular state. They all start out with utility-- or rather Q utility, at zero. But as we go, we start to update, and we have an update formula that's very similar to the formula for TD learning. It has the same learning rate, Alpha, and the same discount factor, Gamma; and we just start applying that. So we start tracking through the state space, and when we get a transition--say we go east from here, and then east and then north and then north; and then east-- and then we would back up this value; and depending on what the values of Alpha and Gamma were, we might update this to .6 or something; and then the next time through, we might update that to .7, and update this one to .4, and so on. In each case, we'd be updating only the action we took, associated with that state, not the whole state. We'd keep repeating that process until we had values filled in for all the action state pairs.
Now, in some sense, you've learned all you need to know about reinforcement learning. Yes, it's a huge field, and there's a lot of other details that we haven't covered but you've seen all the basics. The theory is there and it works. But in another sense, we haven't gone very far because what we've done works for these small 4 by 3 Grid Worlds, But it won't work very well for larger problems: dealing with flying helicopters or playing backgammon-- because there's just too many states and we can't visit every one of the states and build up the correct utility values, or Q values, for all the billions or trillions or quadrillions of states we would need to represent. So let's go back to a simpler type of example. Here's a state in a Pacman game and we can see that this is a bad state, where Pacman is surrounded by 2 bad guys, and there's no place for him to escape. And so reinforcement learning could quickly learn that this is bad. But the problem is that that state has no relation whatsoever to this state. Where conceptually, it's the same problem-- that the Pacman is stuck in a corner and there are bad guys own either sides of him. But in terms of a concrete state, the 2 are completely different. So what we want to be able to do is find some generalization, so that these 2 states look the same, and what I learn for this state-- that learning can transfer over into this state. And so, just as we did in supervised machine learning, where we wanted to take similar points in the state and be able to reason about them, together, we want to be able to do the same thing for reinforcement training. And we can use the same type of approach.
So we can represent a state, not by an exhaustive listing of everything that's true in the state-- every single dot, and so on. But rather, by a collection of important features. So we can say that a state is this collection of Feature 1, Feature 2, and so on. And what are the features? Well, they don't have to be the exact position of every piece in the board. They could be things like the distance to the nearest Ghost or maybe the square of the distance--or the inverse square; or the distance to a dot or food-- or the number of Ghosts remaining. And then we can represent the utility of a state, or let's go with a Q value, of a state action pair and represent that as the sum over some set of waits times the value of each feature. And then our task, then, is to learn good values of these waits-- how important is each feature, whether they're positive or negative, and so on. This formulation will be good to the extent that similar states have the same value. So if these 2 states have the same value, that would be good because we could learn that, in both cases, Pacman is trapped. It would be bad, to the extent that dissimilar states have the same value-- say, if we're ignoring something important. So, for example, if one of the features was: Is Pacman in a tunnel? It would probably be important to know: is that tunnel a dead end or not? And if we represented all tunnels the same, we'd probably be making a mistake. Now, the great thing is that we can make a small modification to our Q learning algorithm where, when we were updating, the Q of S, A got updated in terms of a small change to the existing Q of S, A values. We can do the same thing with the wait's sub-i values. We can update them as we make each change to the Q values. And they're both driven by the amount of error. If the Q values are off by a lot, we have to make a big change; if they're not, we make a small change-- the same thing with the Wi values. And that looks just like what we did when we used supervised machine learning to update our waits. So we can apply that same process, even though it's not supervised. It's as if we're bringing our own supervision to reinforcement learning.
In summary, then, we've learned how to do a lot with MDPs-- especially using reinforcement learning. If we don't know what the MDP is, we know how to estimate it and then solve it. We can estimate the utility for some fixed policy, pi; or we could estimate the Q values for the optimal policy while executing an exploration policy. And we saw something about how we can make the right trade-offs between exploration and exploitation. So reinforcement learning remains one of the most exciting areas of AI. Some of the biggest surprises have come out of reinforcement learning-- things like Tesauro's backgammon player or Andrew Ng's helicopter; and we think that there's a lot more that we can learn. It's an exciting field, and one where there's plenty of room for new innovation.