This project has three parts. The first two use a familiar Gridworld domain to train a Reinforcement Learning agent. The third part involves designing an agent that can successfully play Pacman! Please download the code template (
ml3_project.zip under Downloadables section here) in which most of the base Gridworld and Pacman GUI handlers are already designed. You have to do the following:
This project will be automatically verified through the Udacity autograder.
The Gridworld is designed such that you first must enter a pre-terminal state (the double-walled cells shown in the GUI) and then take the special 'exit' action before the episode actually ends (in a virtual terminal state called
TERMINAL_STATE, which is not shown in the GUI).
Figure 1. The Gridworld GUI
The Gridworld environment is created and managed using
gridworld.py. Try running it in these two modes:
python gridworld.py -h (for a full list of options)
python gridworld.py -m (for moving the agent in the gridworld manually using arrow keys)
When you move the agent manually, you will see that the agent follows your actions only 80% of the time. Such is the life of the Gridworld agent!
Look at the console output that accompanies the graphical output (or use
-t for all text). You will be told about each transition the agent experiences (to turn this off, use
As in Pacman, positions are represented by
(x, y) Cartesian coordinates and any arrays are indexed by
[x][y], with 'North' being the direction of increasing
y, etc. By default, most transitions will receive a reward of zero, though you can change this with the living reward option (
Write a value iteration agent in the class
ValueIterationAgent, which has been partially specified for you in
valueIterationAgents.py. Your value iteration agent is an offline planner, not a reinforcement agent, and so the relevant training option is the number of iterations of value iteration it should run (option
-i) in its initial planning phase.
ValueIterationAgent takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns.
Value iteration computes k-step estimates of the optimal values,
ValueIterationAgent using .
getValue(state) returns the value of a state.
getPolicy(state) returns the best action according to computed values.
getQValue(state, action) returns the q-value of the (state, action) pair.
The following command loads your
ValueIterationAgent (selected using
-a value command-line option), which will compute a policy and execute it 100 times (
-i 100). Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state (
V(start)) and the empirical resulting average reward are quite close:
python gridworld.py -a value -i 100 -k 10
You can submit and verify your state values here. NOTE: If you run an episode manually (
-m), your total return may be less than expected, due to the discount rate (
0.9 by default).
Note that your value iteration agent does not actually learn from experience. Rather, it ponders its MDP model to arrive at a complete policy before ever interacting with a real environment. When it does interact with the environment, it simply follows the precomputed policy (e.g. it becomes a reflex agent). This distinction may be subtle in a simulated environment like a Gridworld, but it's very important in the real world, where the real MDP is not available.
A stub of a Q-learner is specified in
qlearningAgents.py, and you can select it with the
-a q option. For this question, you must implement
getPolicy() methods. Note that for
getPolicy(), you should break ties randomly for better behavior. The
random.choice() Python function will help.
In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal. With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard:
python gridworld.py -a q -k 5 -m
-k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake."
Complete your Q-learning agent by implementing epsilon-greedy action selection in
getAction(), meaning it chooses random actions some small epsilon fraction of the time, and follows its current best Q-values otherwise.
python gridworld.py -a q -k 100
Which approach (between Value Iteration and Q-Learning) turned out to be faster? Submit your answer here.
Students enrolled in the full course experience: Discuss why you think we observe this difference with your coach.
(Optional) Try a non-zero living reward (
-r), e.g. a negative value or a high positive one. How does this affect the policy learned?
Time to play some Pacman! Our Pacman will play the game in two phases. In the first phase, training, Pacman will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate Q-values even for tiny grids, Pacman's training games run in quiet mode by default, with no GUI (or console) display. Once Pacman's training is complete, it will enter testing mode. When testing, Pacman's
self.alpha will be set to 0.0, effectively stopping Q-learning and disabling exploration, in order to allow Pacman to exploit the learned policy. Test games are shown in the GUI by default. Without any code changes you should be able to run Q-learning Pacman for very small grids as follows:
python pacman.py -p PacmanQAgent -x 2000 -n 2010 -l smallGrid
PacmanQAgent is already defined for you in terms of the
QLearningAgent you've already written.
PacmanQAgent is only different in that it has default learning parameters that are more effective for the Pacman problem (
epsilon = 0.05,
alpha = 0.2,
gamma = 0.8).
Here we train the agent on 2000 iterations (
-x 2000) and test it on 10 iterations for a total of 2010 runs (
-n 2010). Your solution is considered to receive full credit for this question if the command above works without exceptions and your agent wins at least 80% of the test runs. Refine
QLearningAgent till you achieve this level of performance.
(Optional) Run your agent on some larger grids and observe the change in performance, e.g.
-l mediumGrid (you can find more grids in the layouts directory). How often does the agent win? Try varying the number of training iterations till you get similar results as on the small grid. Comment on the reason behind this difference in performance between smaller and larger grids.