So today is an exciting day. We'll talk about planning under uncertainty, and it really puts together from the material we've talked about in past classes. We talked about planning, but not under uncertainty, and you've had many, many classes of under uncertainty, and now it gets to the point where we can make decisions under uncertainty. This is really important for my own research field like robotics where the world is full of uncertainty, and the type of techniques I'll tell you about today will really make it possible to drive robots in actual physical roles and find good plans for these robots to execute.
[Narrator] Planning under uncertainty. In this class so far we talked a good deal about planning. We talked about uncertainty and probabilities, and we also talked about learning, but all 3 items were discussed separately. We never brought planning and uncertainty together, uncertainty and learning, or planning and learning. So the class today, we'll fuse planning and uncertainty using techniques known as Markov decision processes or MDPs, and partial observer Markov decision processes or POMDPs. We also have a class coming up on reinforcement learning which combines all 3 of his aspects, planning, uncertainty, and machine learning. You might remember in the very first class we distinguished very different characteristics of agent tasks, and here are some of those. We distinguished deterministic was the casting environments, and we also talked about photos as partial observable. In the area of planning so far all of our evidence falls into this field over here, like A, depth first, right first and so on. The MDP algorithms which I will talk about first fall into the intersection of fully observable yet stochastic, and just to remind us what the difference was, stochastic is an environment where the outcome of an action is somewhat random. Whereas an environment that's deterministic where the outcome of an action is predictable and always the same. An environment is fully observable if you can see the state of the environment which means if you can make all decisions based on the momentary sensory input. Whereas if you need memory, it's partially observable. Planning in the partially observable case is called POMDP, and towards the end of this class, I'll briefly talk about POMDPs but not in any depth. So most of this class focuses on Markov decision processes as opposed to partially observable Markov decision processes. So what is a Markov decision process? One way you can specify a Markov decision process by a graph. Suppose you have states S1, S2, and S3, and you have actions A1 and A2. In a state transition graph, like this, is a finite state machine, and it becomes Markov if the outcomes of actions are somewhat random. So for example if A1 over here, with a 50% probability, leads to state S2 but with another 50% probability leads to state S3. So put differently, a Markov decision process of states, actions, a state's transition matrix, often written of the following form which is just about the same as a conditional state transition probability that a state is prime is the correct posterior state after executing action A in a state S, and the missing thing is the objective for the Markov decision process. What do we want to achieve? For that we often define a reward function, and for the sake of this lecture, I will attach rewards just to states. So each state will have a function R attached that tells me how good the state is. So for example it might be worth $10 to be in the state over here, $0 to be in the state over here, and $100 to be in a state over here. So the planning problem is now the problem which relies on an action to each possible state. So that we maximize our total reward.
[Narrator] Before diving into too much detail, let me explain to you why MDPs really matter. What you see here is a robotic tour guide that the University of Bonn, with my assistance, deployed in the German museum in Bonn, and the objective of the this robot was to navigate the museum and guide visitors, mostly kids, from exhibit to exhibit. This is a challenging planning problem because as the robot moves it can't really predict its action outcomes because of the randomness of the environment and the carpet and the wheels of the robot. The robot is not able to really follow its own commands very well, and it has to take this into consideration during the planning process so when it finds itself in a location it didn't expect, it knows what to do. In the second video here, you see a successor robot that was deployed in the Smithsonian National Museum of American History in the late 1990s where it guided many, many thousands of kids through the entrance hall of the museum, and once again, this is a challenging planning problem. As you can see people are often in the way of the robot. The robot has to take detours. Now this one is particularly difficult because there were obstacles that were invisible like a downward staircase. So this is a challenging localization problem trying to find out where you are, but that's for a later class. In the video here, you see a robot being deployed in a nursing home with the objective to assist elderly people by guiding them around, bring them to appointments, reminding them to take their medication, and interacting with them, and this robot has been active for many, many years and been used, and, again, it's a very challenging planning problem to navigate through this elderly home. And the final robot I'm showing you here. This was built with my colleague Will Whittaker at Carnegie Melon University. The objective here was to explore abandoned mines. Pennsylvania and West Virginia and other states are heavily mined. There's many abandoned old coal mines, and for many of these mines, it's unknown what the conditions are and where exactly they are. They're not really human accessible. They tend to have roof fall and very low oxygen levels. So we made a robot that went inside and built maps of those mines. All these problems have in common that they have really challenging planning problems. The environments are stochastic. That is the outcome of actions are unknown, and the robot has to be able to react to all kinds of situations, even the ones that it didn't plan for.
[Narrator] Let me give a much simpler example often called grid world for MDPs, and I'll be using this insanely simple example over here throughout this class. Let's assume we have a starting state over here, and there's 2 goal states who are often called absorbing states with very different reward or payout. Plus 100 for the state over here, minus 100 for the state over here, and our agent is able to move about the environment, and when it reaches one of those 2 states, the game is over and the task is done. Obviously the top state is much more attractive than the bottom state with minus 100. Now to turn this into an MDP, let's assume actions are somewhat stochastic. So suppose we had a grid cell, and we attempt to go north. The deterministic agent would always succeed to go to the north square if it's available, but let's assume that we only have an 80% chance to make it to the cell in the north. If there's no cell at all, there's a wall like over here, we assume with 80% chance, we just bounce back to the same cell, but with 10% chance, we instead go left. Another 10% chance, we go right. So if an agent is over here and wishes to go north, then with 80% chance, it finds itself over here, If it goes north from here, because there's no north cell, it'll bounce back with 80% probability, In a cell like this one over here, it'll bounce back with 90% probability, but it still has a 10% chance of going right. This is a stochastic state transition which we can equally define for actions, south, west and east, and now we can see a situation like this conventional planning is insufficient. So for example if you're plan a sequence of actions starting over here, you might go north, north, east, east, east to reach our plus 100 absorbing or final state, but with this state transition model over here, even with the first step it might happen with 10% chance do you find yourselves over here, in which case conventional planning would not give us an answer. So we wish to have a planning method that provides an answer no matter where we are and that's called a policy. A policy assigns actions to any state. So for example a policy might look as follows: for this state, we wish to go north, north, east, east, east, but for this state over here, we wish to go north, maybe east over here, and maybe west over here. So each state, except for the absorbing states, we have to define an action to define a policy. The planning problem we have becomes one of finding the optimal policy
[Narrator] To understand the beauty of a policy, let me look into stochastic environments, and let me try to apply conventional planning. Consider the same grid I just gave you, and let's assume there's a discrete start state, the one over here, and we wish to find an action sequence that leads us to the goal state over here. In conventional planning we would create a tree. In C1, we're given 4 action choices, north, south, west, and east. However, the outcome of those choices is not deterministic So rather than having a single outcome, nature will choose for us the actual outcome. In the case of going north, for example, we may find ourselves in B1, or back into C1. Similarly for going south, we might find ourselves in C1, or back in C2, and so on. This tree has a number of problems. The first problem is the branching factor. While we have 4 different action choices, nature will give us up to 3 different outcomes which makes up to 12 different things we have to follow. Now in conventional planning we might have to follow just 1 of those, but here we might have to follow up to 3 of those things. So every time we plan a step ahead, we might have to increase the breadth of the search of tree by at least a factor of 3. So one of the problem is the branching factor is too large.
[Narrator] To understand the branching factor, let me quiz you on how many states you can possibly reach from any other states, and as an example from C1, you can reach under any action choice B1, C1, and C2, but it will give you an affective branching factor of 3. So when I ask you what's the affective branching factor in B3? What is the maximum level of states you can reach under any possible action from B3? So how many states can we reach from B3 over here?
[Narrator] And the answer is 8. If you go north, you might reach this state over here, this one over here, this one over here. If you go east, you might reach this state over here, or this one over here, or this one over here, or this one over here. When you put it all together, you can reach all of those 8 states over here.
[Narrator] There are other problems with the search paradigm. The second one is that the tree could be very deep, and the reason is we might be able to circle forever in the area over here without reaching the goal state, and that makes for a very deep tree, and until we reach the goal state, we won't even know it's the best possible action. So conventional planning might have difficulties with basically infinite loops. The third problem is that many states recur in the search. In a star, we were careful to visit each state only once, but here because the actions might carry you back here to the same state, C1 is, for example, over here and over here. You might find that many states in the tree might be visited many, many different times. Now if you had a state it doesn't really matter how you got there. Yet, the tree doesn't understand this, and it might expand states more than once. These are the 3 problems that are overcome by our policy method, and this motivates in part by calculating policies is so much better of an idea than using conventional planning and still casting environments. So let's get back to the policy case.
[Narrator] Let's look at the grid world, again, and let me ask you a question. I wish to find an optimal policy for all these states that with maximum probability leads me to the absorbing state plus 100, and as I just discussed, I assume there's 4 different actions, north, south, west, and east that succeed with probability 80% provided that the corresponding grid cell is actually attainable. I wish to know what is the optimal action in the corner set over here, A1, and I give you 4 choices, north, south, west, and east.
[Narrator] And the answer is east. East in expectation transfers you to the right side, and you're one closer to your goal position.
[Narrator] Let me ask the same question for the state over here, C1, which one is the optimal action for C1?
[Narrator] And the answer is north. It gets you one step closer. There is 2 equally long paths, but over here you risk falling into the minus 100; therefore, you'd rather go north.
[Narrator] The next question is challenging. Consider state C4, which one is the optimal action provided that you can run around as long as you want. There's no costs associated with steps, but you wish to maximize the probability of ending up in plus 100 over here. Think before you answer this question.
[Narrator] And the answer is south. The reason why it's south is if we attempt to go south, an 80% probability we'll stay in the same cell. In fact, a 90% probability because we can't go south and we can't go east. In a 10% probability, we find ourselves over here which is a relatively safe state because we can actually go to the left side. If we were to go just west which is the intuitive answer, then there's a 10% chance we end up in the minus 100 absorbing state. You can convince yourself if you go south, find ourselves eventually in state C3, and then go west, west, north, north, east, east, east. You will never ever run risk of falling into the minus 100, and that argument is tricky and to convince ourselves let me ask the other hard question: so what shall we do in state B3 that's the optimal action?
[Narrator] And the answer is west. If you're over here, and we go east, we'd likely end up with minus 100. If you go north, which seems to be the intuitive answer, there's a 10% chance we fall into the minus 100. However, if we go west, then there's absolutely no chance we fall into the minus 100. We might find ourselves over here. We might be in the same state. We might find ourselves over here, but from these states over here, there's safe policies that can safely avoid the minus 100.
[Narrator] So even for the simple grid world, the optimal control policy assuming stochastic actions and no costs of moving, except for the final absorbing costs, is somewhat nontrivial. Take a second to look at this. Along here it seems pretty obvious, but for the state over here, B3, and for the state over here, C4, we choose an action that just avoids falling into the minus 100, which is more important than trying to make progress towards the plus 100. Now obviously this is not the general case of an MDP, and it's somewhat frustrating they'd be willing to run through the wall, just so as to avoid falling into the minus 100, and the reason why this seems unintuitive is because we're really forgetting the issue of costs. In normal life, there is a cost associated with moving. MDPs are gentle enough to have a cost factor, and the way we're going to denote costs is by defining our award function over any possible state. We are reaching the state A4, gives us plus 100, minus 100 for B4, and perhaps minus 3 for every other state, which reflects the fact that if you take a step somewhere that we will pay minus 3. So this gives an incentive to shorten the final action sequence. So we're now ready to state the actual objective of an MDP which is to minimize not just the momentary costs, but the sum of all future rewards, but you're going to write RT to denote the fact that this reward has received time T, and because our reward itself is stochastic, we have to complete the expectation over those, and that we seek to maximize. So we seek to find the policy that maximizes the expression over here. Now another interesting caveat is a sentence people put a so called discount factor into this equation with an exponent of T, where a discount factor was going to be 0.9, and what this does is it decays future reward relative to more immediate rewards, and it's kind of an alternative way to specify costs. So we can make this explicit by a negative reward per state or we can bring in a discount factor that discounts the plus 100 by the number of steps that it went by before it reached the plus 100. This also gives an incentive to get to the goal as fast as possible. The nice mathematical thing about discount factor is it keeps this expectation bounded. It easy to show that this expression over here will always be smaller or equal to 1 over 1 minus gamma times the absolute reward maximizing value and which in this case would be plus 100.
The definition of the expected sum of future possible discounted reward that it has given you allows me to define a value function. For each status, my value of the state is the expected sum of future discounted reward provided that I start in state S, then I execute policy pi. This expression looks really complex, but it really means something really simple, which is suppose we start in the state over here, and you get +100 over here, -100 over here. And suppose for now, every other state costs you -3. For any possible policy that assigns actions to the non-absorbing states, you can now simulate the agent for quite a while and compute empirically what is the average reward that is being received until you finally hit a goal state. For example, for the policy that you like, the value would, of course, for any state depend on how much you make progress towards the goal, or whether you bounce back and forth. In fact, in this state over here, you might bounce down and have to do the loop again. But there's a well defined expectation over any possible execution of the policy pi that is generic to each state and each policy pi. That's called a value. And value functions are absolutely essential to MDP, so the way we're going to plan is we're going to iterate and compute value functions, and it will turn out that by doing this, we're going to find better and better policies as well.
Before I dive into mathematical detail about value functions, let me just give you a tutorial. The value function is a potential function that leads from the goal location--in this case, the 100 in the upper right-- all the way into the space so that hill climbing in this potential function leads you on the shortest path to the goal. The algorithm is a recursive algorithm. It spreads the value through the space, as you can see in this animation, and after a number of iterations, it converges, and you have a grayscale value that really corresponds to the best way of getting to the goal. Hill climbing in that function gets you to the goal. You can simplify. Think about this as pouring a glass of milk into the 100th state and having the milk descend through the maze, and later on, when you go in the gradient of the milk flow, you will reach the goal in the optimal possible way.
Let me tell you about a truly magical algorithm called value iteration. In value iteration, we recursively calculate the value function so that in the end, we get what's called the optimal value function. And from that, we can derive, look up, the optimal policy. Here's how it goes. Suppose we start with a value function of 0 everywhere except for the 2 absorbing states, whose value is +100 and -100. Then we can ask ourselves the question is, for example, for the field A3, 0 a good value. And the answer is no, it isn't. It is somewhat inconsistent. We can compute a better value. In particular, we can understand that if we're in A3 and we choose to go east, then with 0.8 chance we should expect a value of 100. With 0.1 chance, we'll stay in the same state, in which case the value is -3. And with 0.1 chance, we're going to stay down here for -3. With the appropriate definition of value, we would get the following formula, which is 77. So, 77 is a better estimate of value for the state over here. And now that we've done it, we can ask ourselves the question is this a good value, or this a good value, or this a good value? And we can propagate value backwards in reverse order of action execution from the positive absorbing state through this grid world and fill every single state with a better value estimate then the one we assumed initially. If we do this for the grid over here and run value iteration through convergence, then we get the following value function. We get 93 over here. We're very close to the goal. This state will be worth 68, and this state is worth 47, and the reason why these are not so good is because we might stay quite a while in those before we'll be able to execute an action that gets us outside the state. Let me give you an algorithm that defines value iteration. We wish to estimate recursively the value of state S. And we do this based on a possible successor state as prime that we look up in the existing table. Now, actions A are non-deterministic. Therefore, we have to go through all possible as primes and weigh each outcome with the associated probability. The probability of reaching S prime given that we started state S and apply action A. This expression is usually discounted by gamma, and we also add the reward or the costs of the state. And because there's multiple actions and it's up to us to choose the right action, we will maximize over all possible actions. See, we look at this equation, and it looks really complicated, but it's actually really simple. We compute a value recursively based on successor values plus the reward and minus the cost that it takes us to get us there. Because Mother Nature picks a successor state for us for any given action, if you compute an expectation over the value of the successor state weighted by the corresponding probabilities which is happening over here, and because we can choose our action, we maximize over all possible actions. Therefore, the max as opposed to the expectation on the left side over here. This is an equation that's called backup. In terminal states, we just assign R(s), and obviously, in the beginning of value iteration, these expressions are different, and we have to update. But as Bellman has shown a while ago, this process of updates converges. After convergence, this assignment over here is replaced by the equality sign, and when this equality holds true, we have what is called a Bellman equality or Bellman equation. And that's all there is to know to compute values. If you assign this specific equation over and over again to each state, eventually you get a value function that looks just like this, where the value really corresponds to what's the optimal future cost reward trade off that you can achieve if you act optimally in any given state over here.
Let me take my example world and apply value iteration in a quiz. As before, assume the value is initialized as +100 and -100 for the absorbing states and 0 everywhere else. And let me make the assumption that our transition probability is deterministic. That is, if I execute the east action of this state over here with probability 1 item over here, if I assume the north action over here, probability 1, I will find myself in the same state as before. There is no uncertainty anymore. That's really important for now, just for this one quiz. I'll also assume gamma equals 1, just to make things a little bit simpler. And the cost over here is -3 unless you reach an absorbing state. What I'd like to know, after a single backup, what's the value of A3?
And the answer is 97. It's easy to see that the action east is the value maximizing action. Let's plug in east over here. Gamma equals 1. This is a single successor state of 100, so if we have 100 over here minus the 3 that it costs us to get there, we get 97.
Let's run value iteration again, and let me ask what's the value for B3, assuming that we already updated the value for A3 as shown over here.
And again, making use of the observation that our state transition function is deterministic, we get 94, and the logic is the same as before. The optimal action here is going north, which we will succeed with the probability 1. Therefore, we can use the value recursively from A3 to deflect back to B3.
And finally, I would like to know what's the value of C1, the figure down here, after we ran value iteration over and over again all the way to a convergence. Again, gamma equals 1. State transition function is deterministic.
And the answer is easily obtained if you just subtract -3 for each step. We get 88 and 85 over here. We could also reach the same value going around here. So, 85 would have been the right answer, and this will be the value function after convergence. It's beautiful to see that the value function is effective the distance to the positive absorbing state times 3 subtracted from 100. So, we have 97, 94, 91, 88, 85 and so on. This is a degenerate case. If we have a deterministic state transition function, it gets more tricky to calculate for the stochastic case.
Let me ask the same question for the stochastic case. We have the same world as before, and actions have stochastic outcomes. The probability 0.8, we get the action we commanded, otherwise we get left or right. And assuming that the initial values are all 0, calculate for me for a single backup the value of A3.
This should look familiar from the previous material. It's 77, and the reason is in A3, we have an 80% chance for the action going east to reach 100. But the remaining 20%, we either stay in place or go to the field down here, both of which have an initial value of 0. That gives us 0, but we have to subtract the cost of 3, and that gives us 80 - 3 = 77. It's also easy to verify that any of the other actions have lower values. For example, the value of going west will be minus 3, so the value of going west would right now be estimated as -3, and 77 is larger than -3. Therefore, we'll pick 77 as the action that maximizes the updated equation over here.
And here's a somewhat non-trivial quiz. For the state B3, calculate the value function assuming that we have a value function as shown over here and all the open states have a value of assumed 0, because we're still in the beginning of our value update. What would be our very first value function for B3 that we compute based on the values shown over here?
And the answer is 48.6. And obviously, it's not quite as trivial as the calculation before because there's 2 competing actions. We can try to go north, which gives us the 77 but risks the chance of falling into the -100. Or we can go west, as before, which gives us a much smaller chance to reach 77, but avoids the -100. Let's do both and see which one is better. If we go north, we have a 0.8 chance of reaching 77. There's now a 10% chance of paying -100 and a 10% chance of staying in the same location, which at this point is still a value of 0. We subtract our costs of 3, and we get 61.6 - 10 - 3 = 48.6. Let's check the west action value. We reach the 77 with probability 0.1 with 0.8 chance we stay in the same cell, which has the value of 0, and with 0.1 chance, we end up down here, which also has a value of 0. We subtract our costs of -3, and that gives us 7.7 - 3 = 4.7. At this point, going west is vastly inferior to going north, and the reason is we already propagated a great value of 77 for this cell over here, whereas this one is still set to 0. So, we will set it to 48.6.
So, now that we have a value backup function that we discussed in depth, the question now becomes what's the optimal policy? And it turns out this value backup function defines the optimal policy as completely opposite of which action to pick, which is just the action that maximizes this expression over here. For any state S, any value function V, we can define a policy, and that's the one that picks the action under argmax that maximizes the expression over here. For the maximization, we can safely draw up gamma and R(s). Baked in the value iteration function was already an action choice that picks the best action. We just made it explicit. This is the way of backing up values, and once values have been backed up, this is the way to find the optimal thing to do.
I'd like to show you some value function after convergence and the corresponding policies. If we assume gamma = 1 and our cost for the non-absorbing state equals -3, as before, we get the following approximate value function after convergence, and the corresponding policy looks as follows. Up here we go right until we hit the absorbing state. Over here we prefer to go north. Here we go left, and here we go north again. I left the policy open for the absorbing states because there's no action to be chosen here. This is a situation where the risk of falling into the -100 is balanced by the time spent going around. We have an action over here in this visible state here that risks the 10% chance of falling into the -100. But that's preferable under the cost model of -3 to the action of going south. Now, this all changes if we assume a cost of 0 for all the states over here, in which case, the value function after convergence looks interesting. And with some thought, you realize it's exactly the right one. Each value is exactly 100, and the reason is with a cost of 0, it doesn't matter how long we move around. Eventually we can guarantee in this case we reach the 100, therefore each value after backups will become 100. The corresponding policy is the one we discussed before. And the crucial thing here is that for this state, we go south, if you're willing to wait the time. For this state over here, we go west, willing to wait the time so as to avoid falling into the -100. And all the other states resolve exactly as you would expect them to resolve as shown over here. If we set the costs to -200, so each step itself is even more expensive then falling into this ditch over here, we get a value function that's strongly negative everywhere with this being the most negative state. But more interesting is the policy. This is a situation where our agent tries to end the game as fast as possible so as not to endure the penalty of -200. And even over here where it jumps itself into the -100's it's still better than going north and taking the excess of 200 as a penalty and then leave the +100. Similarly, over here we go straight north, and over here we go as fast as possible to the state over here. Now, this is an extreme case. I don't know why it would make sense to set a penalty for life that is so negative that even negative death is worse than living, but certainly that's the result of running value iteration in this extreme case.
So, we've learned quite a bit so far. We've learned about Markov Decision Processes. We have fully observable with a set of states and corresponding actions where they have stochastic action effects characterized by a conditional probability entity of P of S prime given that we apply action A in state S. We seek to maximize a reward function that we define over states. You can equally define over states in action pairs. The objective was to maximize the expected future accumulative and discounted rewards, as shown by this formula over here. The key to solving them was called value iteration where we assigned a value to each state. There's alternative techniques that have assigned values to state action pairs, often called Q(s, a), but we didn't really consider this so far. We defined a recursive update rule to update V(s) that was very logical after we understood that we have an action choice, but nature chooses for us the outcome of the action in a stochastic transition probability over here. And then we observe the value iteration converged and we're able to define a policy if we're assuming the argmax under the value iteration expression, which I don't spell out over here. This is a beautiful framework. It's really different from planning than before because of the stochasticity of the action effects. Rather than making a single sequence of states and actions, as would be the case in deterministic planning, now we make an entire field a so-called policy that assigns an action to every possible state. And we compute this using a technique called value iteration that spreads value in reverse order through the field of states.
So far, we talked about the fully observable case, and I'd like to get back to the more general case of partial observability. Now, to warn you, I don't think it's worthwhile in this class to go into full depth about the type of techniques that are being used for planning and uncertainty if the world is partially observable. But I'd like to give you a good flavor about what it really means to plan in information spaces that we reflect the types of methods that are being brought to bear in planning and uncertainty. Like my Stanford class, I don't go into details here either because the details are much more subject to more specialized classes, but I hope you can enjoy the type of flavor of materials that you're going to get to see in the next couple of minutes.
So we now learned about fully observable environments, and planning in stochastic environments with MDPs I'd like to say a few words about partially observable environments, or POMDPs--which I won't go into in depth; the material is relatively complex. But I'd like to give you a feeling for why this is important, and what type of problems you can solve with this, that you could never possibly solve with MDPs. So, for example, POMDPs address problems of optimal exploration versus exploitation, where some of the actions might be information-gathering actions; whereas others might be goal-driven actions. That's not really possible in the MDPs because the state space is fully observable and therefore, there is no notion of information gathering.
I'd like to illustrate the problem, using a very simple environment that looks, as follows: Suppose you live in world like this; and your agent starts over here, and there are 2 possible outcomes. You can exit the maze over here-- where you get a plus 100-- or you can exit the maze over here, where you receive a minus 100. Now, in a fully observable case, and in a deterministic case, the optimal plan might look something like this; and whether or not is goes straight over here or not, depends on the details. For example, whether the agent has momentum or not. But you'll find a single sequence of actions and states that might cut the corners, as close as possible, to reach the plus 100 as fast as possible. That's conventional planning. Let's contrast this with the case we just learned about, which is the fully observable case or the stochastic case. We just learned that the best thing to compute is a policy that assigns to every possible state, an optimal action; and simplified speaking, this might look as follows: Where each of these arrows corresponds to a sample control policy. And those are defined in part of the state space that are even far away. So this wouuld be an example of a control policy where all the arrows gradually point you over here. We just learned about this, using MDPs and value iteration. The case I really want to get at is the case of partial observability-- which we will eventually solve, using a technique called POMDP. And in this case, I'm going to keep the location of the agent in the maze observable. The part I'm going to make unobservable is where, exactly, I receive plus 100 and where I receive minus 100. Instead, I'm going to put a sign over here that tells the agent where to expect plus 100, and where to expect minus 100. So the optimum policy would be to first move to the sign, read the sign; and then return and go to the corresponding exit, for which the agent now knows where to receive plus 100. So, for example, if this exit over here gives us plus 100, the sign will say Left. If this exit over here gives us plus 100, the sign will say Right. What makes this environment interesting is that if the agent knew which exit would have plus 100, it will go north, from a starting position. It goes south exclusively to gather information. So the question becomes: Can we devise a method for planning that understands that, even though we'd wish to receive the plus 100 as the best exit, there's a detour necessary to gather information. So here's a solution that doesn't work: Obviously, the agent might be in 2 different worlds--and it doesn't know. It might be in the world where there's plus 100 on the Left side or it might be in the world with plus 100 on the Right side, with minus 100 in the corresponding other exit. What doesn't work is you can't solve the problem for both of these cases and then put these solutions together-- for example, by averaging. The reason why this doesn't work is this agent, after averaging, would go north. It would never have the idea that it is worthwhile to go south, read the sign, and then return to the optimal exit. When it arrives, finally, at the intersection over here, it doesn't really know what to do. So here is the situation that does work-- and it's related to information space or belief space. In the information space or belief space representation you do planning, not in the set of physical world states, but in what you might know about those states. And if you're really honest, you find out that there's a multitude of belief states. Here's the initial one, where you just don't know where to receive 100. Now, if you move around and either reach one of these exits or the sign, you will suddenly know where to receive 100. And that makes your belief state change-- and that makes your belief state change. So, for example, if you find out that 100 is Left, then your belief state will look like this-- where the ambiguity is now resolved. Now, how would you jump from this state space or this state space? The answer is: when you read the sign, there's a 50 percent chance that the location over here will result in a transition to the location over here-- There's also a 50 percent chance that the plus 100 is on the Right, so the transition over here is stochastic; and with 50 percent chance, it will result in a transition over here. If we now do the MDP trick in this new belief space, and you pour water in here, it kind of flows through here and creates all these gradients--as we had before. We do the same over here and all these gradients are being created point to this exit on the Left side. Then, eventually, this water will flow through here and create gradients like this; and then flow back through here, where it creates gradients like this. So the value function is plus 100 over here, plus 100 over here that gradually decrease down here, down here; and then gradually further decrease over here-- and even further decrease over there, so we've got arrows like these. And that shows you that in this new belief space, you can find a solution. In fact, you can use value iteration--MDP's value iteration-- in this new space to find a solution to this really complicated partially observable planning process. And the solution--just to reiterate-- we'll suggest: Go south first, read the sign, expose yourself to the random position to the Left or Right world in which you are now able to reach the plus 100 with absolute confidence.
So now we have, learned pretty much, all there is to know about Planning Under Uncertainty. We talked about Markov Decision Processes. We explained the concept of information spaces; and what's better, you can actually apply it. You can apply it to a huge number of problems where the outcome of states are uncertain. There is a huge legislation about robot motion planning. Here are some examples of robots moving through our environments that use MDP-style planning techniques; and these methods have become vastly popular in artificial intelligence-- so I'm really glad you now understand the basics of those and you can apply them yourself.