ml » project »

Final Project for Reinforced Learning

Note: The following instructions are for Udacity students who are enrolled in the Machine Learning 3: Reinforcement Learning course. They do not apply to Georgia Tech OMSCS students.


The following project has three parts. They will help you to design a reinforcement learning agent that can successfully play Pacman. We have provided you with a code base in which most of the base Gridworld and Pacman GUI handlers are already designed. You have to do the following:

In Part 1, you will implement value iteration and test it on the Gridworld domain.
In Part 2, you will implement Q-Learning and test it on the Gridworld domain.
In Part 3, you will implement options and constraints for the Pacman domain.


Parts of this project will be automatically verified through the Udacity autograder, but the majority of the project will be self-assessed using this rubric.

Reference Material

Reinforcement Learning: An Introduction (Sutton and Barto).

Code Base

You can download the codebase here.


Part 1: Value Iteration

The Gridworld MDP is such that you first must enter a pre-terminal state (the double boxes shown in the GUI) and then take the special 'exit' action before the episode actually ends (in the true terminal state called TERMINAL_STATE, which is not shown in the GUI).  If you run an episode manually, your total return may be less than you expected, due to the discount rate (-d to change; 0.9 by default).

Look at the console output that accompanies the graphical output (or use -t for all text). You will be told about each transition the agent experiences (to turn this off, use -q).

As in Pacman, positions are represented by (x,y) Cartesian coordinates and any arrays are indexed by[x][y], with 'north' being the direction of increasing y, etc.  By default, most transitions will receive a reward of zero, though you can change this with the living reward option (-r).


  1. The Gridworld domain can be accessed using the These are the modes you can use:

    a) python -h (for a full list of options)

    b) python -m (for moving the agent in the gridworld manually)

    When you move the agent manually, you will see that the agent follows your actions only 80% of the time. Such is the life of the Gridworld agent!

  2. Write a value iteration agent in ValueIterationAgent, which has been partially specified for you in  Your value iteration agent is an offline planner, not a reinforcement agent, and so the relevant training option is the number of iterations of value iteration it should run (option -i) in its initial planning phase. 

    ValueIterationAgent takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns.

    Value iteration computes k-step estimates of the optimal values, Vk. In addition to running value iteration, implement the following methods for ValueIterationAgent using Vk.

    a) getValue(state) returns the value of a state.

    b) getPolicy(state) returns the best action according to computed values.

    c) getQValue(state, action) returns the q-value of the (state, action) pair.

  3. The following command loads your ValueIterationAgent, which will compute a policy and execute it 10 times. Press a key to cycle through values, q-values, and the simulation. You should find that the value of the start state (V(start)) and the empirical resulting average reward are quite close.

    python -a value -i 100 -k 10

    You can test your state values here.

Part 2: Q-Learning

Note that your value iteration agent does not actually learn from experience.  Rather, it ponders its MDP model to arrive at a complete policy before ever interacting with a real environment.  When it does interact with the environment, it simply follows the precomputed policy (e.g. it becomes a reflex agent). This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available.


  1. A stub of a q-learner is specified in QLearningAgent in, and you can select it with the option '-a q'. For this question, you must implement the update, getValue, getQValue, and getPolicy methods. Note that for getValue and getPolicy, you should break ties randomly for better behavior. The random.choice() function will help. In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal. With the q-learning update in place, you can watch your q-learner learn under manual control, using the keyboard

    python -a q -k 5 -m

    Recall that -k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake."

  2. Complete your q-learning agent by implementing epsilon-greedy action selection in getAction, meaning it chooses random actions epsilon of the time, and follows its current best q-values otherwise.

    python -a q -k 100

Part 3: Pacman and Options/Constraints

It's time to play some Pacman! Pacman will play games in two phases. In the first phase, training, Pacman will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate q-values even for tiny grids, Pacman's training games run in quiet mode by default, with no GUI (or console) display. Once Pacman's training is complete, he will enter testing mode. When testing, Pacman's self.epsilon and self.alpha will be set to 0.0, effectively stopping q-learning and disabling exploration, in order to allow Pacman to exploit his learned policy. Test games are shown in the GUI by default. Without any code changes you should be able to run q-learning Pacman for very tiny grids as follows:

python -p PacmanQAgent -x 2000 -n 2010 -l smallGrid

Note that PacmanQAgent is already defined for you in terms of the QLearningAgent you've already written. PacmanQAgent is only different in that it has default learning parameters that are more effective for the Pacman problem (epsilon=0.05, alpha=0.2, gamma=0.8). You will receive full credit for this question if the command above works without exceptions and your agent wins at least 80% of the last 10 runs.