Here is a sequence of MDP questions. We're given a maze environment with 8 fields where we receive +100 over here and -100 in the corner over here. Our agent can go north, south, west, or east, but actions may fail at random. With probability "P," and P is a number between 0 and 1, the action succeeds, and with 1 - P we go into reverse. For example, if we take the action go east into this state over here, with P probability we find ourselves over there. With 1 - P we find ourselves right over here in the exact opposite direction. Here is the east action again. With P we go to the right, and with 1 - P we go to the left. Of course, if you bounce into a wall we stay where we are. For my first question, I'll assume P equals 1. There is no uncertainty in action outcome, and there is no failure. The state transition function is deterministic. I want you to fill in for each state the final value after running value iteration to completion, and please assume the cost is -4 and we use gamma equals 1 as the discount factor. Please fill in those missing six values over here.
And the answer is obtained by looking at the nearest value -4, which gives us 96 over here, 92, 88, and 84.
Let us now assume that P = 0.8, which means actions fail with probability 0.2. Again, the cost is -4, and gamma equals 1. I want you to run exactly one value calculation for the state in red up here. Assuming that the value function is initialized with 0 everywhere, what will be the value after a single value iteration for the state up here. This is the state a4.
The answer is 76. The value of over here is maximized for the south action, which we reach with 0.8 chance with 0.2 chance we'll stay in the same state for which inital value was 0, and we subtract the action cost of 4, which is 80 + 0 - 4 = 76.
Now using the same premise as before, P equals 0.8, cost equals -4 and gamma equals 1, I'd like you to run value iteration to completion. For the one state over here, a4, I'd like to know what is it's final value. Now you might be tempted to write a piece of computer software, but for this specific state, it's actually possible to do it with a relatively simple peice of math. It's not trivial, but give it a try. What is the value of a4 after convergence?
The answer is 95. To see, we observe the dominant equation suggests that each new iteration doesn't change the value. We kind of know that the optimal policy is to go south over here, so just really the value iteration for going south. Let's call it the value of x. We know that x is updated by 0.8 times 100 plus 0.2 of saying in the same state, whose value is x minus the cost. This invariance must hold true after convergence. We can now resolve it for x. We get 0.8x equals 76, so 76 divided by 0.8 is 95.
Finally, I'd like to ask you what is the optimal policy for the parameters we just studied. I'm listing here all states--a1, a2, a3, a4, a5, b2, and b3-- and I'd like you to tell me whether you would like to go north, south, west, or east in any of those six states over here. For each of those there is exactly one correct answer.
It is easy to see that these three states over here, a1 to a3, you want to go east. In the state a4 up here, you wish to go south. In b2 you with to go north so as to not risk the 0.2 probability of falling into -100. In b3 it's perfectly fine to go east. In the worst case you find yourself in b2, in which case you can safely escape the -100 by going north, turn right again, and go down over here. This is the correct set of answers over here.