ud820 ยป

Final Project: Design Learning Agents for Gridworld and Pacman

Project Overview

This project has three parts. The first two use a familiar Gridworld domain to train a Reinforcement Learning agent. The third part involves designing an agent that can successfully play Pacman! Please download the code template (ml3_project.zip under Downloadables section here) in which most of the base Gridworld and Pacman GUI handlers are already designed. You have to do the following:

  • In Part 1, you will implement Value Iteration and test it on the Gridworld domain. 
  • In Part 2, you will implement Q-Learning and test it on the Gridworld domain. 
  • In Part 3, you will implement options and constraints for the Pacman domain.


This project will be automatically verified through the Udacity autograder.


Part 1: Value Iteration

The Gridworld is designed such that you first must enter a pre-terminal state (the double-walled cells shown in the GUI) and then take the special 'exit' action before the episode actually ends (in a virtual terminal state called TERMINAL_STATE, which is not shown in the GUI).


Figure 1. The Gridworld GUI


  1. The Gridworld environment is created and managed using gridworld.py. Try running it in these two modes:

    a) python gridworld.py -h (for a full list of options)

    b) python gridworld.py -m (for moving the agent in the gridworld manually using arrow keys)

    When you move the agent manually, you will see that the agent follows your actions only 80% of the time. Such is the life of the Gridworld agent!

    Look at the console output that accompanies the graphical output (or use -t for all text). You will be told about each transition the agent experiences (to turn this off, use -q).

    As in Pacman, positions are represented by (x, y) Cartesian coordinates and any arrays are indexed by [x][y], with 'North' being the direction of increasing y, etc.  By default, most transitions will receive a reward of zero, though you can change this with the living reward option (-r).

  2. Write a value iteration agent in the class ValueIterationAgent, which has been partially specified for you in valueIterationAgents.py.  Your value iteration agent is an offline planner, not a reinforcement agent, and so the relevant training option is the number of iterations of value iteration it should run (option -i) in its initial planning phase. ValueIterationAgent takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns.

    Value iteration computes k-step estimates of the optimal values, V_k . In addition to running value iteration, implement the following methods for ValueIterationAgent using V_k .

    a) getValue(state) returns the value of a state. b) getPolicy(state) returns the best action according to computed values. c) getQValue(state, action) returns the q-value of the (state, action) pair.

  3. The following command loads your ValueIterationAgent (selected using -a value command-line option), which will compute a policy and execute it 100 times (-i 100). Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state (V(start)) and the empirical resulting average reward are quite close:

    python gridworld.py -a value -i 100 -k 10

    You can submit and verify your state values here. NOTE: If you run an episode manually (-m), your total return may be less than expected, due to the discount rate (-d flag; 0.9 by default).

Part 2: Q-Learning

Note that your value iteration agent does not actually learn from experience.  Rather, it ponders its MDP model to arrive at a complete policy before ever interacting with a real environment.  When it does interact with the environment, it simply follows the precomputed policy (e.g. it becomes a reflex agent). This distinction may be subtle in a simulated environment like a Gridworld, but it's very important in the real world, where the real MDP is not available.


  1. A stub of a Q-learner is specified in QLearningAgent in qlearningAgents.py, and you can select it with the -a q option. For this question, you must implement update(), getValue(), getQValue() and getPolicy() methods. Note that for getPolicy(), you should break ties randomly for better behavior. The random.choice() Python function will help.

    In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal. With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard:

    python gridworld.py -a q -k 5 -m

    Recall that -k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake."

  2. Complete your Q-learning agent by implementing epsilon-greedy action selection in getAction(), meaning it chooses random actions some small epsilon fraction of the time, and follows its current best Q-values otherwise.

    python gridworld.py -a q -k 100

    Which approach (between Value Iteration and Q-Learning) turned out to be faster? Submit your answer here.

    Students enrolled in the full course experience: Discuss why you think we observe this difference with your coach.

    (Optional) Try a non-zero living reward (-r), e.g. a negative value or a high positive one. How does this affect the policy learned?

Part 3: Pacman and Options/Constraints

Time to play some Pacman! Our Pacman will play the game in two phases. In the first phase, training, Pacman will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate Q-values even for tiny grids, Pacman's training games run in quiet mode by default, with no GUI (or console) display. Once Pacman's training is complete, it will enter testing mode. When testing, Pacman's self.epsilon and self.alpha will be set to 0.0, effectively stopping Q-learning and disabling exploration, in order to allow Pacman to exploit the learned policy. Test games are shown in the GUI by default. Without any code changes you should be able to run Q-learning Pacman for very small grids as follows:

python pacman.py -p PacmanQAgent -x 2000 -n 2010 -l smallGrid

Note that PacmanQAgent is already defined for you in terms of the QLearningAgent you've already written. PacmanQAgent is only different in that it has default learning parameters that are more effective for the Pacman problem (epsilon = 0.05, alpha = 0.2, gamma = 0.8).

Here we train the agent on 2000 iterations (-x 2000) and test it on 10 iterations for a total of 2010 runs (-n 2010). Your solution is considered to receive full credit for this question if the command above works without exceptions and your agent wins at least 80% of the test runs. Refine QLearningAgent till you achieve this level of performance.

(Optional) Run your agent on some larger grids and observe the change in performance, e.g. -l mediumGrid (you can find more grids in the layouts directory). How often does the agent win? Try varying the number of training iterations till you get similar results as on the small grid. Comment on the reason behind this difference in performance between smaller and larger grids.