It is going quite well. Thank you very
much for asking. How are things going with you?
I can't complain because I've been barred by a, a judge.
[LAUGH] I'm pretty sure you can still complain. Complaining is a
fine art that requires lots of practice. Guess what we're going to do today?
I'm reading decision making and
reinforcement learning which is very exciting.
Except the hyphen.
Except for the hyphen?
Yeah the hyphens wrong.
it's not decision dash making.
You're right. I normally don't do that
but people always want to put a dash
so what I'm going to do is for your
Michael. Just for you. I'm going to remove the dash.
And I'm going to put it over here where it belongs.
[LAUGH] No. There. Well, Michael, this is
the first of our discussions on the last
mini course of machine learning, Reinforcement Learning, which
I believe you know a little bit about.
am very interested in Reinforcement Learning.
Excellent. So I'm just going to do
a little bit of background. It's going to
be relatively short and straight forward. We're
going to do a couple of quizzes, because
I know how much you like quizzes. And then we're just going to go back
and forth and see what we get to learn about Reinforcement Learning. Sound good?
Excellent, okay so, let's get started.
Decision Making and Reinforcement Learning
Okay Michael so before we get started diving into
a particular formalism or framework that I want to
talk about, that we are going to use for
this mini course or at least the first half or
so of it. I want to remind everybody what
the differences are between the three types of learning
that we said we are going to look at
over this entire course. Th, you recall what they are?
Well reading off the slides supervised, unsupervised and reinforcement.
That's right. And you know, again these things are all strongly
related to one other but it's very useful to think about them as being very
separate. So just as a reminder, supervised learning
sort of takes the form of function approximation
where you're given a bunch of x, y pairs And your goal is to find
a function f that will map some new x to a proper y, you recall that?
Unsupervised learning is very similar to supervised learning except that
it turns out that you're given a bunch of x's
and your goal is to find some f. That gives
you a compact description of the set of x's that
you've seen. So we call this clustering, or description as
opposed to function approximation, and finally we're getting to reinforcement
learning. Now reinforcement learning is actually a, a name that
means many things in different fields, and here we tend
to talk about it in a relatively specific way, and
superficially it looks a lot like Supervised learning, in
that we're going to be given a string of pairs
of data, and we're going to try to learn some
functions. But in the function approximation case, a supervized learning
case, we were given a bunch of X and Y pairs. We were asked to learn F, but
in reinforcement learning, we were given something totally different.
Were instead going to be given x's and z's, and
I'll tell you what the x's and the z's stand for in
a minute, and then were going to have to learn some f that
is going to generate y's, and so even though its going to look
a lot like function approximation, its going to turn out that it's going to
have a very different character. And that's really what the, the next
few slides in a little bit of discussion is really about, is
understanding what that character is. You'll also notice from the title here
that I have decision making reinforcement
learning, and that's because reinforcement learning is
one mechanism for doing decision making. And again, I'll define
that in just a second. Okay, so you with me, Michael?
I think so. So should y be circled?
because in some sense, you, you underlined the things we
were given and you circled the things we needed to
find, so y is something that we're going to find, right?
Yeah, I suppose that I like that.
All right. Interesting.
Now it's a theta.
[LAUGH] It's a Y trapped inside of a theta and it's yelling, Y?
[LAUGH] I like that. I'm a Y
trapped in a theta. Hm, I need to write a book
about that. Okay, good. That's a good point Michael. So before
we were learning, effectively trying to learn one thing and here
we're still learning one thing. Because it's going to produce another thing
for the deterministically, usually. But it's worth pointing out that we
are going to be figuring how to produce both of these things
as opposed to be given those things. Good job. OK. Are
you ready to move forward,l Michael, with an example and a quiz?
Michael, here is a world. In fact, for the
purposes of this discussion it is the world. Okay. So
imagine the entire universe is well described by this
picture over here. Okay? Now this is called a grid
world, which is something that people in reinforcement learning
love to think about, because it's a nice approximation for
all the complexity of the entire universe. Now this particularly
world is a three by four grid. So you have
one comma one, two comma one, three comma one, four comman
one. You have one comma two, one comma three and so on
and so forth. For the purpose of this discusision we can
think of the world as being a kind of game where you
start out in state. Which we're going to call the start state.
And your able to execute actions, one of these four, up, down,
left to right and the purpose is to wonder around this
world in such a way that eventually you make it to the
goal over here, this little green spot. You see that Michael?
And under all circumstances, you must avoid the red spot.
Exactly. Now for this particular example, up
does what you think it does, down does
what you think it does, as do left and right. But if you find yourself at a
boundary, such as right, up here in thi
upper-left-hand corner, and you try to go up, you
just stay where you are. If you try to go left, you just stay where you are.
But if you go right, you do actually end up in the next square. Got it?
Okay. Three last things. One, this little black, space here, is
a place you can't enter into, so it acts just like a
wall. This green space is the goal, and once you're there it's
over. The world is over and you get to start over again.
And once you enter into the red spot, the world is also over and you
have to start over again. So you can't go through the red square to get to
the green square. Okay, you got it?
Excellent. So, here's the quiz. Given this
particular world, with the physics I just described to
you, and given these actions that you can take,
up, down, left and right, what is the shortest
sequence of actions that would get us from
the start state to the goal state? And you
can just type in the words up, down, left
and right, separated by spaces, or commas or anything
like that. Okay. [LAUGH] Say semicolons are allowed, colons are not.
Yeah, good. I am glad you made that distinction.
Well, it's an important one to make. Okay, you got it?
Yeah, I think so.
Alright, cool. So lets see what you get then. Go.
Okay Michael, you go an answer for me? This isn't meant to be a hard quiz.
Oh good. Cause I actually have two answers for you.
Oh, I like that.
So I would say right, right, up, up, right.
Okay, so, right, right, up, up, right.
But I feel like I could have also said up, up, right, right, right.
And you could have. Both of those are acceptable. right, right
does what you think it does. You move right and then right and
then up and then up and then right. And
that takes five steps. Or you could have gone
up, up, right, right, right, which would also have
taken you five steps. So that's pretty easy, right?
Yeah, I thought so.
Rag. So yes, either of these would be correct. We're going to
stick with this one in particular, since they're equal, because we gotta pick
one of them. So this does point out something that often in
a decision problem like this, there are in fact multiple answers or multiple
optima, if I can tie it back
to our randomized optimization discussions. Okay, so you
got the quiz, you got the world, you understand what you can do here. Yes?
Okay, cool. Now I'm going to throw in a little wrinkle.
The World II
I'm going to change the world just a little bit Michael. Okay,
that last one was really easy, and it was easy for
a bunch of reasons, but one of the reasons it was
easy is that every time it took an action it did
exactly what you expected it to do. Now I want to introduce
a little bit of uncertainty, or stochasticity into the world. So,
let me tell you exactly what that means. When you execute
an action, it executes correctly, with probability of 0.8. So 80% of
the time, if you go up, it goes up. Assuming you can go up. If it goes, if you
say down, it goes down, assuming you can. Left,
it goes left; right, it goes right. Got it? Yeah!
Now, 20% of the time, the action you take actually causes you to
move at a right angle. Now, of course, there are two different right angles you
could go to, if you go up, you could go either left or right at
a right angle. And so that 20%
gets distributed uniformally. Okay? Does that make sense?
Yeah, I think so. And so if you, if you're in the start state and you try
to go up, then there a 10% chance that you tend to bump into the wall.
And then you just stay where you are I guess.
Right. So if you, if you decide that you'd have x
here, you have a 80% chance of moving up, you have a 10%
chance of moving to the right. And you have a 10% chance
of moving to the left but, of course, you'd bump a wall and
you would end up right back where you started. Got it?
So, here is my quiz question for you Michael. You recall, we came up
with two sequences and the one that I decided to keep was up, up, right, right,
right. My question to you, is, what is the reliability of the sequence, up up,
right right, actually getting you from the start
to the goal, given these probabilities, this uncertainty?
we're just going to try that sequence, and ask whether
or not it actually got us to the goal.
Exactly. And when I say, reliability, I really
mean, what's the probability of it actually succeeding? Interesting. Okay.
Alright. So let's see if we can figure out the answer. You're ready?
The World II
Okay Michael do you have an answer?
I have an answer indeed.
Okay, what's the answer?
Okay, maybe I don't have an answer indeed.
But I have I have the ability to compute one.
I'm, I'm doing some math.
So why don't you walk me through the math you're doing.
I got .32776.
That is correct Michael.
Woo-hoo. That was trickier than I thought.
Okay well, show me what you did.
Alright, so the
first thing I did is I said, okay, well this
is not so hard, because, from the start-state, if I
execute up, up, right, right, right. Each of those things
works the way it's supposed to do independently with probability .8.
Yep. And so, .8 raised to the 5th power, gives
me the probability that that entire sequence will work as intended.
Exactly. And what is .8 to the 5th? Do you know?
32,768. with a decimal point in front
of it, because you know, powers of 2.
Wow. That is correct. But I notice that 32768
is not 32776. No they differ by a little smidge.
So this is what occured to me next is,
is there anyway that you could have ended up falling into
the goal from that sequence of, of commands not following the
intended path. So since actions can have unintended consequences, as they
I was going to ask okay, so if I go up in the
first step, there's a probably .1 that I'll actually go to the right.
Yep. From there, if I go up, there's
a 10% probability that I'll actually go to the right.
From there, the next thing I do is take the
right action, but that can actually go up with some probability, .1.
And then another .1 to get to for
the next right action to actually cause an up to happen.
Mm-hm. And then finally, that last right might
actually execute correctly and bring me into the goal. So
I could go underneath the barrier, instead of around the
barrier with that same sequence. It just isn't very likely.
Okay. Well how unlikely is it?
Alright. So I did .1 times .1 times .1 times .1
times .8. huh, so the .1 to the 4 times .8 and
that's right so this is the probability that the, 4 of the sequences, 4 of
these go wrong. In fact exactly the first 4 go wrong. And the last one goes
right. Right. And that's equal to, some
very, very small number. And when you add
it up. You end up with .32776. In fact that's equal to 0.00008. And that's how
you get that number and that's correct. And you'll
notice that this this sequence happens to work out to
be the second sequence that, we had options for. Yeah,
the other thing that took us to the goal, right.
Was exactly the probability of executing that action
executing that sequence of transitions given the first command.
Yeah, and I think it would work the other way, too. No,
it wouldn't quite work the other way. If you had done the sequence
right, right, what is it? Right, right. Up, up right?
In order for that one to work out, you'd add .8 to the
right would have to send you up and then the ups would have to send
you right, and then, yeah, so it would actually work out to be the same.
So no matter which of the two sequences you came
up with, they would have the same probability of succeeding. Neat.
Nice. That's actually kind of cool. So, good
job, Michael. Very good job on the quiz. A
lot of people forget this part. And, in fact, if
you forgot that part, and got, just this part
right, we actually let you pass. But it was wrong.
I was kind of expecting you to get
it wrong, but I'm glad you got it right too.
Thank you Michael, I appreciate your faith in me.
I wrote the quiz. Actually, I stole this from a book,
Oh. which, whose little words are now showing
up in front of you, exactly where you can
go through the details of this quiz. Okay, alright, Michael. So you
might ask yourself why I brought this up, and the reason I brought
this up is because, what we did in the first case where we
just came up with the sequence up up right right right, is we
sort of planned out what we would do in a world where
nothing could go wrong. But once we introduced this notion of uncertainty, this,
this randomness, this stochasticity, we have to do something other than work out
in advance what the right answer is, and then just go. We have
to do something a little bit more complicated. Either we have
to try to execute these, and then every once in a
while, see if we've drifted away. And then re-x, re-plan, come
up with a new sequence wherever it is we happen to end
up, or we have to come up with some way to
incorporate all of these uncertainties and probabilities. So that, we never
really have to think, rethink what to do in case something
goes wrong. So, what I'm going to do next is I'm going to introduce
the framework, that's very common, that people use as a
way of capturing this stuff, capturing these uncertainties directly. Okay?
Markov Decision Processes
So this is the framework that we're going to be using
through most of the discussions that we'll be having at least
on reinforcement learning. The single
agent reinforcement learning, and it's called
the Markov Decision Process. This should sound familiar to you, Michael.
Well, you did say we're going to talk about decisions.
That's true, and we need a process for making decisions. And we're going
to introduce something called the Markovian property
as a part of this discussion and
I'll tell you exactly what that means in a moment.
So, I'm just going to write out this frame work and
just, and tell you what it is and what the
problem it produces for us. And then we're going to start
talking about solutions through the rest of the discussion. So
a Markov Decision Process tries to capture worlds like this
one by dividing up in the following way. We say
that there are states. And states are a set of tokens
that somehow represent every state, for lack of a better word, that
one could be in. So, does that make sense to you, Michael?
Yeah, I think so.
So what would the states be in the
world that we've been playing around in so far?
So, the only thing that differs from
moment to moment is where, I guess, I am.
Like, which grid, grid position I'm in.
So, I feel like each different grid position I could
be in is a state, maybe there's a state for being successfully
done or unsuccessfully done?
It's possible. But let's stick with the simple one. I
like that one because that's really, I think, easy to grasp. So,
there are at least, of all the states one could reach,
there's, well let's see there's four times three minus one, since you
can never reach this state. Although we could say it is
a state we just happen to never reach it. So, at most
if we just think of this grid literally as a grid there
are something like twelve different states. And we can represent these states
as their X,Y coordinates, say. We could call
this, the start state as say 1,1, which
is sort of how I described it earlier. We could describe the goal state as 4,4.
And say this is how we describe our
states. Or frankly, it doesn't matter. We could
call these states 1,2,3 up to 12. Or we could name them Fred and Marcus. It
doesn't really matter. The point is
that they're states, they represent something, and
we have some way of knowing which state we happen to be in. Okay?
Markov Decision Processes Two
Alright so we've got states. So next is whats called
the model or the transition model. Sometimes people refer to
it as the transition function, and basically it is a
function of three variables. It's a state, an action, which
I haven't defined for you yet and another state. In
fact since I haven't defined what an action is for
you yet, let's skip that and actually do define actions
for you. So the third part of our model are actions.
Actions are the things that you can do in a
particular state. So what would the actions be in this world?
Well the four different decisions I could make were the
up, down, left and right. Though it's maybe a little confusing because
I think those were also the four possible outcomes. That's true.
Well no, there are other outcomes. You could stay where you were.
So these are actually the actions. These are the things that
when I'm in a given state I'm allowed to execute. I can
either go up, go down, go left or go right. You'll notice
that as I described before there was no option not to move. Mm.
But there could have been, and there could
have been other actions, like teleport, or there's anything
that you can imagine. But the point is that
your action set represents all of the things that the
agent, the robot, the person, or whatever it is
you're trying to model, is allowed to do. Now, in
its full, generalized form, we can think of the
set of actions that one can take as being a
function of state Because there's some stage you can do some
things and some stage you can't do those things. But most of
the time people just treat it as a set of actions. And
the actions that aren't allowable in them particular states have no effect.
Alright, so we understand states. They're the things that describe the world.
We understand actions. Those are the things that you can do, the
commands you can execute when your in particular states. And what a
model describes, what the transition model describes is in some sense the
rules of the game that you're playing. It's
the physics of the world. So it's a function
of three variables: a state, an action and
another state, which, by the way, could actually be
the same state. And what this function produces
is the probability That you will end up transitioning
the state s prime, given that you were in
state s, and you took action a. Got it?
I think so. so the s prime is where you end up, and the s, a is what
you're given. So it's, so you plug these three
in. Oh I see, and you get a probability Mm-hm.
But the probabilities have to add up to one if
you if you sum it up over all the s primes.
Right. That's exactly right. So for example if you think about the
deterministic case where there was no noise then this is a very simple model. If
I'm in the state the start state. And I take the action up,
then what's the probability, I end up in the state immediately above it?
Was that a 0.08?
No, in the, in, when we first started in the deterministic world.
Oh, that was probability one.
Right, and what would be the probability, you
ended up in the state to the right? Probability zero.
Right. In fact, the probability that you end up at any of the
other states is zero in the deterministic world. Now what about the case when
we were in the non-deterministic world where an action
would actually execute faithfully only 80% of the time. If
I'm in the start state and I go up, what's
the probability that I end up. In the state above.
That was 0.8
Was the probability, I end up in the state to the right.
That was 0.1.
And, what was the probability I end up where I started.
That was also 0.1.
Right, and 0 everywhere else. And, that's just sort of the way
it works. So, the model really is an important thing. And the reason,
it's important. Is it really does describe the rules of the game.
It tells you what you, what will happen if you do something in
a particular place. It captures Everything that you know about the transition
u, of the world is what you know about the rules, got it?
You called it physics before?
I called it the physics of the world.
These are the rules that don't change.
But they're very different from real world physics.
Well, yeah, although they
don't have to be. I mean in some sense, you could argue
that a mark off decision process, what we described so far, these three
things In fact, do describe the universe, the states are, you know,
in the positions of all the atoms. The positions and velocities of all
the atoms in the universe. The transition miles as you do, take
certain positions in the world whatever they are how the state of the
universe changes in response to that. And the actions or whatever those set
of actions could be. And it can be probabilistic or it's not probabilistic.
It's definitely probabilistic. The transition
models are by their very definition probabilistic.
Markov Decision Processes Three
Now, I actually snuck something important in here, I
actually snuck two things that are important in here.
The first is, the, what's called the Markovian property.
Do you remember what the Markovian property is Michael?
From, I dunno. Actually, where does the Markovian property come from?
I'm going to say Russia.
Okay, yeah, from the, from the Russian.
Yeah, so I have Russian ancestors, and they passed onto me this idea that.
Markov means that you don't have to
condition on anything past the most recent state.
That's exactly right.
The Markovian property is that only the present matters.
And they had to pass that down to me you
know, one generation at a time, because you know, Markov property.
Exactly right, that was very good Michael. So what does this
mean? What this means is our
transition function, which shows you the probability
you end up in some state as prime, given that you're in state
S and took action A, only depends upon the current state S. If it
also depended upon where you were 20 minutes before then, you would have
to have more S's here. And then you would be violating the Markovian property.
So is this like, do historians hate this?
Well, you know, one never learns anything from history.
No, you're supposed to learn from history, or you're doomed to,
I don't know, let me make something up, repeat it.
[LAUGH] Fair enough. Historians probably don't like
this, but there is a way for mathematicians to
convince them that they're okay with it. And
the way that, you, mathematicians convince you that you're
okay with this is to point out that
you can turn almost anything, into a Markovian Process
by simply making certain that your current state
remembers everything you need to remember from the past.
So, in general, even if something isn't
really Markovian, you need to know what you were, not only what
you're doing now, but what you were doing five minutes ago. You could
just turn your current state into what you're doing now and what you
were doing five minutes ago. The obvious problem with that, of course, is
that if you have to remember everything from the beginning of time.
You're only going to see every state once and it's going to be very difficult
to learn anything. But, that Markovian property turns out to be, turns out
to be important and it actually allows us to solve these problems in
a tractable way. But I snuck in something else, Michael.
What I snuck in is this idea about the transition
model, is that nothing ever changes. So this second property
that matters for Markov decision processes, at least for the
sets of things that we're going to be talking about in
the beginning, is that things are stationary. That is these
for the purposes of this discussion, it means that these
rules don't change. Over time. That's one notion of stationary, okay?
Does that mean that the agent can never leave
the start state?
No, the agent can leave the start state any
time it takes the action, then it gets another start state.
Then how is it, how is it stationary, then?
It's not stationary, the world is stationary. The, the transition
model is stationary, the physics are stationary, the rules don't change any.
The rules don't change.
I, I, okay. I see.
Then there's another notion of stationary
that we'll see a little bit later.
Okay, last thing to point out about
the definition of a mark on decision process,
is the notion of reward. So, reward is simply a
scalr value, that you get for being in a state. So
for example we might say, you know, this green goal
is a really good goal. And so, if you get there,
we're going to give you a dollar. This red dual,
on the other hand, is a very, very bad state. We
don't want you to end there. And so, if you end
up there we're going to take away a dollar from you.
What if I don't have a dollar?
Someone will give you the dollar.
The universe will give you the dollar.
[LAUGH] And then take it away?
Or, the universe will take it away. Even
if you don't have it. You'll have negative dollars.
Okay, so Reward is very important here in a of couple
ways. Reward is one of the things that, as I'm always harping on
encompasses our domain knowledge. So, the Rewards you get from the state tells
you the usefulness of entering into that state. Now. I wrote out three
different definitions of r here, because sometimes it's very useful to think
about them differently. I've been talking about the reward you get for entering
into the state, but there's also a notion of reward that you get
for entering into a state and taking an action. There's a rewar, or,
being in a state and taking an action, there's a reward that you
could get for being in a state, taking an action, and then ending
up in another state as prime. It turns out these are all mathematically
equivalent. But often it's easier to think about one form or the other.
But for the purposes of the, you know, for the rest of this discussion,
really you can just focus on that one, the reward of the value entering
into a state. And those four things,
by themselves, along with this Markov property
and non-stationarity, defines what's called the Markov
Decision Process. Or an MDP. Got it?
I'm a little stuck on the, how those could be mathematically equivalent.
Well we'll get to that later. Would you like a little bit of intuition?
Well, you can imagine that just as
before we were dealing with the the notion
of making a non-Markovian thing Markovian by putting
a little bit of history into your state.
You can always fold in the action that you took to be in a state or
the action that you took to get to a state, as a part of your state.
But that would be a different Markov Decision Process.
It would, but they would work out to have the same solution.
Oh, I see.
Markov Decision Processes Four
Speaking of solutions, this is the last little bit of thing
that you need to know. And that is. This defines a
problem. But, what we also want to have, whenever we have
a problem. Is a solution. So, the solution to the Markov
Decision Process, is something called a policy. And, what a policy
does. Is, it's a function, that takes in a state. And
returns an action, in other words, for any given state that
you're in, it tells you the action that you should take.
Like as a hint?
No, it just tells you, this is the at.
Well, I mean, I suppose you don't have to do
it, but the way we think about Markov Decision Processes,
is that this is the action that will be taken.
I see, so it's more of an order.
Yes, it's a command. Okay.
So that's all a policy is. A policy is
solution to a Markov Decision Process. And there is a
special policy, which I'm writing here as policy star, or
the optimal policy, and that is the policy that maximizes
your long-term expected reward. So if all the policies you
could take, of all the decisions you might take, this
is the policy that optimizes the amount of reward that
you're going to receive or expect to receive over your lifetime.
So, like, at the end?
Well, at yeah, at the end, or at
any given point in time, how much reward you're receiving.
From the Markov Decision Process point of view, there doesn't
have to be an end. Okay. Though in this example,
you don't get anything, and then at the end, you get paid off.
Right, or unpaid off.
If you fall into the red square. So
actually, your question points out something very important here. I
mentioned earlier when I talked about the three kinds of
learning that there, supervised learning and reinforced learning were sort
of. Similar, except that instead of getting Ys and
Xs we were given Ys and, Xs and Zs. And
this is exactly what's happening here. Here what we would
like to have if we wanted to learn a policy
is a bunch of sa pairs as training examples. Well
here's the state and the action you should've took, taken, here's
another state and the action you should've taken, so on and
so forth. And then we would learn a function, the policy,
that maps states to actions. But what we actually see
in the reinforcement learning world, in the Markov Decision Process world,
is we see states, actions, and then the rewards that we
received. And so in fact, this problem of seeing a sequence
of states, actions, and rewards. It's very different
from the problem of being told. This is the
correct action to take to maximize a function.
Or find a function that maps from state to
action. Instead, we say well, if you're in
this state, and you take this action, this is
the reward that you would see. And then
from that, we need to find the optimal action.
So Pi star is being the F from that previous slide?
And R is being Z?
Yes. And y is being a.
And s is being x or x is being s
So but, I'm, okay I'm a little confused
about this notion of a policy. So we have
the, the, the thing we tried to do to
get the goals was up, up, right, right, right. Yes.
I don't see how to capture that as a policy.
It's actually fairly straightforward. What a policy would say is:
What state are you in? Tell me what action you
should take. So, the policy, basically is this: When you're in
the state, start, the start state, the action you should take
is up. And it would have a mapping. For every state
that you might see, whether it's this state, this state,
this state, this state, this state, this state, this state, or
even these two states, and it will tell you what action
you should take. And that's what a policy is. A policy,
very simply, is nothing more than a function that tells you what
action to take at every, in any state you happen to come across.
Okay, but the, but the. The question that
you asked before was about up, up, right, right, right.
And, it seems like, because of
the stochastic transitions. You might not be in
the same state. Like, you don't know what
state you're in, when you take those actions.
No, so, one of the things for what we're talking about
here, for the Markov Decision Process.
Is, there're states, there're actions, there're rewards.
You always know what state you're in, and you know what reward you receive.
So does that mean you can't do up, up right right right?
Well, the way it would work in a Markov Decision Process,
so what you're describing is is what's often called a plan. You
know, it's, tell me what sequence of actions I should take from
here. What Markov Decision Process does and what a pr, a policy
does is it doesn't tell you what sequence of actions to take from a particular
state. It tells you what action to take in a particular state. You will then
end up in another state because of
the transition model, the transition function. And then
when you're in that state you ask the
policy what actions should I take now? Okay.
Right, so this is actually a key point.
Although we talked about it in the language of planning,
which is very common for the people who di, for example
take any ag course, the thing about this in terms of planning,
what are the things that I can do to accomplish my
goals? The Markov Decision Process way of thinking about it, the reinforcement
way of thinking about it, or the typical reinforcement learning way
of thinking about it, really doesn't talk about plans directly. But instead,
talks about policies. Which from which you can infer a plan, but
this has the advantage that it tells you what to do everywhere.
And it's robust to the underlying stochastic of the word. World.
So, is it clear that's all you need to be able to behave well.
Well, it's certainly the case, that if you have a policy and that policy
is optimal, it does tell you what to do, no matter what situation you're in.
And so, if you have that, then that's definitely
all you need to behave well. But I mean could it
be that you wanted to do something like up, up,
right, right, right which you cant write down as a policy?
And why cant
you write that down as a policy?
Because the policies are only telling you what action to do as a function
of the state not sort of like how far along you are in the sequence.
Right unless, of course, you fold that into your state some how. But thats
exactly right, the way to think about this is. The idea of coming up with
a concrete plan of what to do for the next 20 time steps is different
from the problem of whatever step I happen to be in, whatever state I happen
to be in, what's the next best thing
I can do? And just always asking that question.
If you always ask that question,
that will induce a sequence, but that sequence
is actually dependent upon the set of states
that you see. Whereas in the other case
where we wrote down a particular policy, you'll
notice that was only dependent upon the state
you started in and it had to ignore the states that you saw along the way.
And the only way to fix that would be
to say, well, after I've taken an action, let me look at the state
I'm in and see if I should do something different with it. But if
you're going to do that, then why are you trying to compute the complete set
of states? Or I'm sorry, the complete set of actions that you might take.
Okay, so there you go. Now, a lot
of what we're going to be talking about next Michael, is,
given that we have MDP, we have this Markov Decision
Process defined like this. How do we go from this
problem definition to finding a good policy, and
in particular, finding the optimal policy? That makes sense.
Good. And there you go.
More About Rewards
Okay, so, I want to talk about two things on this
little slide, here. The first one is kind of a general
observation about rewards and what makes the reinforcement learning MDP problem
different from the supervised learning problems that we did before. And
that's the notion of not just rewards, but this notion
of delayed rewards. So, what do I mean by that. Well,
so, if you think about the way we've set up kind
of problem like, like we have here where we're trying to.
Start in this little bottom left-hand square, and wind
our way up into this plus one. There's really
this notion of sequences of action. So in the
reinforcement learning context, and all the things that we're talking
about, at least in the foreseeable future, there's this
sort of problem where you take some action and
that puts you in some place and then you
take another action and that puts you in some place.
And then you take another action and that puts you in some place
and maybe it puts you at a place where you get plus one.
Or maybe it puts you at a place where you get minus one.
And what really is going on here is this idea that you take
actions that will set you up for other actions that will set you
up for other actions. And only then do you know how good those
particular actions you took were. So this reward is not just an idea
of getting a reward at every state, it's an idea of getting delayed reward.
So you don't know how your immediate action
is going to lead to things down the road.
So, let me give you a concrete example
about that. So, have you ever played chess?
So let's say we played a long game
of chess and maybe took 60, 61, 62 moves. And
then at the end, I win the game. So, what
do you know about the way you played that game?
That I probably made a bad decision around
the time when I decided to play chess against you.
That's possible, but maybe you made a
good decision and you kept making good decisions, and
you only messed up at the very last move,
when you could have mated me but you didn't.
I see. Well you, what, my
experience in playing chess is that usually, that's
not what happens. Usually it's not that I make a bad move and then there's
a problem. It's usually I make a move that I think is reasonable that only
later I discover put me in a position where I can't possibly do anything good.
Right. Well that's actually
the game that I played in New York many,
many years ago that I lost was exactly like that.
The game went on for like literally a hundred
moves. I really lost the game on the third move.
I made a mistake, i transposed two moves because it was a new
opening that I was just learning and I knew at the time that i screwed
up, i played beautiful chess from that point on, but the truth is the
other player had a positional advantage from
that point on that I could never overcome,
so I lost the game, but not because I played poorly for,
you know, 80 moves. It's because I paid, played poorly for one
move, and that move happened to be fairly early. So this is
this notion of delayed reward, that I played this long game of
chess, and maybe it's I played well, and I screwed up in
the end. Maybe I played medio, you know, mediocre game, but I
had a couple of brilliant moves, and that's why I won, or
maybe I played very well in the beginning, poorly at the end or
the other way around. And the truth is you don't
really know. All you know is that you're taking a bunch
of actions. You get rewards signals back from the environment. Like
I won the game or I lost the game. And then
you have this problem of figuring out of all the actions
that I took, what was the action that led to me
ultimately winning or losing, or getting whatever reward that I, I
got at the end of a sequence. Does that make sense?
Yeah it seems like that could
be really challenging. You probably need your
own like sportscaster listening and, and commenting.
Yeah and in fact the sportscaster who is listening and commenting
at least in the NDP world is in fact the sequence of rewards
that you get in the sequence of states that you see right?
It really is sort of a play by play. In this state this
action got this reward and you want to take all of that
and figure out whether this was a good action you're first action or
a poor action or your second action was good or poor and so
on and so forth. Now contrast that with supervised learning, so in supervised
learning what you would be getting is while I was in this state This
first date and then this was the proper action I was supposed to take
lets say action seventeen. And your goal then is just to simply learn a
function from states to specific actions, that's
how the supervised learning problem is setup, right?
In this particular case you're in some
state you take some action and you get
some reward for the action you took. Or may be the state that you ended up
in or something. And you get a sequence of these
state action award triples, and ultimately you have to figure out
for the given states you're in, what was the action
you took that helped to deter, or actions you took that
helped to determine the ultimate sequence of rewards that you
saw. Perhaps this one plus one or this minus one that
you got at the end. This is really difficult problem,
its got its own name, its called the credit assignment problem.
In fact, we're talking about a sequence
of events over time, we typically refer to
this type of problem as the temporal
credit assignment problem. Does that make sense Michael.
Yeah that's really cool. Although I guess its credit but also blame right?
Well credit and blame you know, blame is just the minus of credit.
So without loss of generality Assume that
credit can assume a negative or positive vibe.
And yet when I go to the supermarket they never say blame or credit.
Well that's just because they don't understand
the technical terms.
More About Rewards Two
We're really interested in solving this
temporal credit assignment problem as we will
see as we talk later on in the course. This sort of issue
comes up again and again and again. But, I didn't really want to talk
about that in any great detail, I just wanted to make certain that,
that people saw the difference between
supervised learning and then sort of problems
we're trying to solve now. That really has to do with sequences and
time. What I want to just is focus a little bit more on
rewards. And in particular, I want to look at the little grid world
that we've been playing with so far, and think about how
we would learn how to get from our start state. We
always start it over here, over to one of the plus
one or the minus 1, depending upon the kind of rewards
that we see. So, let's push on this rewards notion for
a minute, if we think about how MDP's are made up.
And I'm just going to write down for you what the
reward is, for the states in this particular grid world, okay?
So, let's say
I'm going to set the rewards to be equal to,
minus 0.04. So, the first thing I want to do, is
I want to make certain you understand what I mean
by this. So, what I mean when I write this
down, for this particular example. Is that the reward
that I'm going to receive in every single state is minus
I've already labelled as plus 1 and minus 1. Okay?
it seems like you should just never ever move then, right?
no, I don't think that's true. Let's think
about that for a little while. So, if I
set this reward to be minus 0.04 and you're
starting out here in this state, and you have
to keep living, you always have to take an
action, right? That's the way I've set this up.
And you don't get to stop until you reach
either plus 1 or negative 1, we'll call these
terminating or absorbing states. Once you reach
into one of these things that ends
the game. What would having a small
negative reward like, this encourage you to do?
It's, it's making me think about kind of walking across hot beach, because
like each time you take a step you're going to get a little bit of punishment
and the only thing that ends it is if you can get into that nice
cool ocean. And then you get a plus 1 for that and the minus 0.04s,
minus 0.04s, stop.
Right, so, that means, so wait how would
you put that another way. Sounds to me like
what you're saying is, by having a small negative
reward everywhere, it encourages you to end the game.
Yeah, that's [UNKNOWN]. Yeah, I agree with that.
Okay, yeah and, and I think that's exactly what I like
that analogy a lot. Your in a, well, it's not a very
hot beach it's just a you know, a slightly unpleasantly warm beach
and you really want to hurry up and get into the ocean.
In fact, you want to hurry up and get into the ocean
even though on your way to getting to the ocean. You might step
on some glass, that the minus 1 is. So, let's go back
to this notion of policy of mapping states to action and let's think
about as being in this world where you have a reward of
minus 0.04 every where except in the absorbing states. What do you think
the best set of actions are to take in these different states?
What do you think the best policy is? Actually before you tell me,
let me write down what the actual policy is. I'm not
going to tell you, how we get here but I'm about to
write down the policy that's sort of the best one to
take in a world where you have these sets of rewards. Okay.
Okay, you see this Michael.
Yeah, and it makes a lot of sense. Like if
you're basically heading towards the, the rewarding state, the green state.
Right, so if I start out here in the leftmost bottom state, basically,
it says, go up and then go to the right which, coincidentally is the policy
that we had found before.
Cool. There's a couple spots that are a little bit strange, though.
Well, I'm thinking about the one, not directly under the minus 1,
that makes sense to me, but the one, it takes you now to a
position. Where it seems like you want to go straight up to the goal.
But yet, it's going the long way around. It just didn't see the goal.
No that's not it at all because of the way we
are going to learn this, you're going to. So this is a global policy and
I'm just telling you it's the optimal policy given the
rewards. So let's kind of work out why going to
the left makes sense here. Well, so here's my argument.
What this basically says is take the long way around. Yes?
But, by taking the long way around, what's happening? Well, on
the downside I'm going to pick up a little bit of negative reward,
for a while, you know, one, two, three, four, five, six or
so which is something like negative
But, by doing this, I avoid every getting into
a state where it might fall into the negative 1.
Because it's, it's stochastic.
Right, it's exactly, so stochastic means
if I'm in this state here no matter
what I do I have som, if I go up I have some chance of
moving to the right and following into
the negative 1. It's a relatively low chance,
it's only 10%. But, it works out that moving from here to here, the probability
of me falling into here. Is to high compared to the sort of near impossibility
of me ending up falling into the minus 1, if I can follow this path.
Interesting, okay, I guess that makes sense, it's cool.
Which actually suggests sort of the point that
I want to make here which is, that minor changes to
your reward function actually matter. So, if we had
a slightly different reward here, say. Not minus 0.04 but
something else, you might find that some of these
decisions would be different. I think you can see that?
Hm. I mean at the level that I'm understanding it, it seems like if
it's zero, that then it's in no particular hurry. If
it's minus something big, then maybe it's in a bigger hurry. But,
I don't, yeah I don't see exactly what the difference is going to be.
Okay, so let's see if we can help you see the difference by
making you take a quiz.
Okay, Michael, so here's the quiz. Are you ready?
Okay, so here's what's going on. I have copied over our little grid world
over here twice. So it's the same grid world except I have changed the rewards
that you will receive in all of the states other than the two absorbing states.
So for the top one you get a reward of plus two. In every single
state, except for the absorbing states, and in the bottom one you receive a
reward of minus two in every single state but the absorbing states, got it?
So, what I'm going to want you to do is, I'm going to want
you to fill in for these boxes that I've indicated, these same four boxes in
the top and the bottom, whether you should go up, down, left or right. So
enter into each one of them either Up U, D down, R right, L left,
Yeah, I appreciate you making the numbers nice and big, because I was
worried you're going to say something like minus 0.13. But this, this I think.
I can do that.
No, no, no, that wasn't a suggestion. Yeah, let's go with the plus two and
minus two. I feel like there's a better chance I might be able to get it.
[LAUGH] Okay, so you think you got the what we're trying to do here?
I do. Now just to, just to keep, just to make sure that I'm there,
the plus one and the minus one end the game,
and there's no more reward after that, it's like zero forever.
And the other things, the blueish rewards, I would get them.
Continually until a state is reached.
Where the game ends, okay Yeah. No, that's, that's, that's cool.
Okay, cool. Alright so go.
Okay, Mikey, you got answers for me?
Wait, should I, do need to make any assumption, or
can I make any assumptions I want about the non-blue states?
No. It doesn't matter. [LAUGH] Actually. [LAUGH]
So the answer is no.
Maybe it does.
The answer is no. But it
doesn't actually matter because Remember, it's all
very stationary and Marcovian. So the only thing that matters is where you are.
You're stationary Marcovian. Alright, so I think,
okay. So let's, let's do the first one, then.
So, so this is, this is pretty cool.
So by changing the reward the way that you
did, it is now, it's not a hot beach
anymore. It's like. You know like, super awesome money beach.
Yes, it's like a beach made of bacon.
Sure. Okay. Let's think of it that way.
Although neither of us is eating bacon at the moment.
At the moment.
Great, because, because then our mouths
would be full and it would be very difficult to give
this lecture. Alright so, no, no, no,no, no I mean like, even.
Alright, never mind. The point is that it's awesome, and there's like
diamonds. I don't know, I just saw The Hobbit movie, and there
is a room filled entirely with treasures. And so, this is
kind of like that. And so now the question is, what should
I be doing, like I should never go to the plus 1
or the minus 1, because that ends my treasure gathering. I should
just continue to stay in states that aren't those states.
So I would say left for the, for the top state.
Mm-hm. Let's say left for the bottom state.
The, yes, uh-huh.
And the other two, I need to think about for a moment.
So I feel like I'd like to get away from the minus 1.
So, like, you know, to go up.
But then that would give me some chance of slipping, wouldn't it?
Mm-hm. Lets see, I don't want to go to the right. I don't
want to go down. I don't want to go. I'm going to go to the left.
I'm going to bash my head against the wall, because then I get money.
That's weird. There's really no pain for running into the wall?
No. You just stay where you are.
Alright and for the bottom one, I feel
like it's the same kind of thing. I don't
want to go to the right or to the left,
cause then there's a probability that I'll slip and
end up ending my game. So I'm going to go down.
yeah. I like that. So most of these are
right, and one of them is wrong. Or, it's not
that it's wrong, it's just that's not as right as
it could be. Which one do you think that is?
I think they're right. But if, if
there's one that I could imagine you not being
so happy with, which is the bottom one,
There are two bottom ones. You mean this one?
Yeah. What would I like?
you might prefer if I bash my head against the wall.
No, actually, it turns out that this is
a correct answer so all four of these are correct.
But in this particular state, it actually doesn't matter which
way you go. Oh, I see. A more complete answer.
Would be, doesn't matter.
And is that true of any of the other ones? No. I guess,
I guess you're right. The other ones, you have to go in the direction indicated.
Right. So, by the way, just like
here, it doesn't matter which direction you go in.
It's actually true for all of the other
states as well. It doesn't matter where you go
because In these three states, these are the
only states where you can accidentally, you could end
the game. And in each of them, you
can take an action that guarantees that you don't
end the game. So it really doesn't matter
about where you do it in the other states.
Right? So that's a good argument. This makes
a lot of sense. So at the end of the
day, basically, because the reward is positive. And you're just
accumulating reward or diamonds and treasure, as you put it.
You just never want to leave, so you always want to avoid either
of these states. It doesn't matter that there's a +1 here
and a -1 here. If I can stay here forever, then
I just accumulate positive reward forever. Okay, so what's the next one?
So, interesting. So now, now the beach
is really, really hot. It's minus 2, so.
Mm-hm. I want to get off here as quickly as I can. So definitely the top
one, I would go to the right. Just try to
get, get that +1 and get out of, get outta here.
Alright, so then, let's think about the bottom right.
So, in the best of all possible worlds
I go around and dump into the +1. And that
would take me again if, if I don't slip
at all one, two, three, four steps to do that
Actually wait one, two, three, three steps one
the hot sand.
And I'm getting and -2 for each of those. So that's like minus 6.
So I could get that plus 1 at the end, and that
makes it only minus 5. But I think I'd rather just get off the
beach. So I'm going to say in the bottom right to actually dive into
the pit of pain, because it can't be as bad as where I am.
That's exactly right.
So, okay, so then the, the one next
to it on the left is a little trickier.
If I go up to the plus 1, I get a minus 2 Followed by a +1 which is also -1.
So okay. I'm not so sure about that one.
Well remember you'll always have some
probablitly of ending up back where you started
Is that true? Oh because, if I. Well no,
no, that's not true if I dive straight into the
-1. Then I have 0.8 chance of going to the
-1. 0.2 of going up. To the arrow that's labeled,
Wait, 0.2? No, 0.1
Whereas if I go up, right, I might stay where I am.
And accrue -2, like, yeah I feel like I need a calculator for this.
Yeah, but you're, I think your intuition
is right, though. Just think about it as
How much, I mean what do you think
has the best chance of most quickly ending things?
Well certainly jumping
into the -1.
I think that's going to be marginally
better because there's only one chance of slippage
and delay. Where as if I take additional
step, there's two chances of slippage and delay.
Exactly, that's exactly the right kind of argument to make
and, and because the -2 is so big. You're you're going to, you're
just better off ending it even if you ended in pain, in
the same way that you were in the bottom right hand corner.
And so bottom left I would say the cleanest thing
would be to jump into the -1. So to go to the
right so that I can go up and get into the -1.
And what's nice about that move is that is that you either end up in the
bottom right hand corner. Or you end up
staying where you are, or you end up moving
up. But in either case, you are always
sort of, you know, you never get farther away,
right? Whereas if you try to go up,
there's a chance that you will get farther away.
Interesting. Is there a way to be sure about these?
Yeah, you just have to do the math.
Do we know how to do the math?
Yeah, you just start, you just do expectations. You
can basically just say, well what's the expected time. Since
you know what you're trying to do here is to
get to the end. You can just, you can calculate the
expected length of time it'll take you to get there.
And then just, you know, multiply by all the rewards you're
going to get in the meantime. If it helps, though, I
will tell you that for any value where the reward is
less than minus 1.6284.
This is the right thing to do.
Just so you know.
Alright, so you're saying you actually do know that
this is the optimum. The thing that we have here.
Yes, I know that this is the optimum.
Alright. Let me just draw out the rest of it for
you. So it's, it's I think interesting to compare this one. The bottom
right one, where you have given me negative rewards to encourage
me to end the game. With our main one up here in
the upper left. Where you have also given me some negative reward
to encourage me to end the game. If you look carefully you'll
notice that in this one, you're encouraged to end the game.
But you're really encouraged to end the game at a plus 1.
Here the reward is so strongly negative, you just need to end
the game. And so you end up with a different response, here.
Because this is, is just as quick to get to
the minus 1 and into pain as to, take this extra
step and try to get to the plus 1. And you
end up with a different response here. Here, here, and here.
That's really interesting.
Yeah, it is. And so, these changes matter. They matter a great deal.
So, how am I suppose to choose my rewards?
Carefully and with some forethought.
But, you know,
again There's lots of ways to think about this. So the way we've been sort of
talking about MDPs is kind of the way
we've been talking about we talked in the first
part of the course. When we talked
about having teachers and learners. You can think
of your reward as sort of the teaching
signal. Sort of the teacher telling you what
you ought to do, and what you ought not to do. But another way of thinking
about this is that because the rewards define
the MDP The rewards are our domain knowledge.
So if you're going to design an MDP to
capture some world. Then you want to think carefully about how
you set the rewards in order to get the behavior that you wish.
That's fair. I mean, it seems a little bit like
a cop out, but I think it seems like a necessary one.
Yeah. I mean, there's really, I mean, again. No matter what
you do, you've gotta be able to inject domain knowledge somehow. Otherwise,
there's no learning to do. And in this case, the reward basaiclly
is telling you how important it is to get to the end.
Okay. Alright. Let's move on.
Sequences of Rewards One
Alright. So having gone through that exercise, Michael, I think it's,
it's worthwhile to step back a little bit and think about
the assumptions that we've been making that have been mostly unspoken.
And I'm going to say that the main assumption that we've been
making in some sense boils down to a single word. And
that word is stationary. So let me tell you what I
mean by that and why by kind of illustrating what it
is we've been sort of doing for a little while. Okay?
Okay. So the first thing I'm going to say is that we've actually been.
Kind of assuming infinite horizons. So what do I
mean by that. When, when we think about the last
grid world that we were playing with, we basically said
well, you know, I want to avoid going to the
end as quickly as possible if I have rewards of
a certain value or whatever. Because, you know, the game
doesn't end until I get to an absorbing state. Well,
that sort of implies. That you basically can live forever.
That you have an infinite time horizon to work
with. Now can you, can you imagine why if
you didn't have an infinite time horizon to work
with you might end up doing something very different?
Different then what, what we're doing in the grid world?
Right, so here let me let me show you the
game that we were, the grid world that we were doing before.
Might help you think about it. So here's the grid world
we had before. And as you recall. We had a particular policy
that sort of made sense. Here, I'll, I'll
write it out for you again. And this was
with a case where we had a reward
of minus 0.04. Remember? We just did this. Remember?
Okay, and this was the policy that turned out to be
optimal, and in the future I want you to pay attention to
here is that when you're over right here near possible end state,
rather than going up, it made sense to take the long way around.
Because you're going to get some negative
reward but it's a small enough negative
reward compared to where you might end up. Okay with a positive one.
Yeah, I see.
And that makes some sense. Well, that only makes sense if you're
going to be living long enough that you can take the long route around.
What if I told you you only had say three times steps left,
and then the game is going to end no matter where you end up?
Well, it might be,
it might make more sense to take some risk than just try
to take the short way because there's really no chance you're going to
get to the plus1. I'm entirely convinced of that though because there's
still a chance you'll fall into the minus 1 along the way.
Right, so the exact val, whether it makes sense to take the risk
or not is going to depend upon two things, we've already talked about one of
them which is the actual word that you get. If this reward were, you
know, negative enough, then clearly it makes sense to just try to end things
quickly, right? We just showed that in the last quiz. But
another thing that it's going to depend upon is how much time you
have in order to get to where you're going. If you've
only got one or two time steps before everything's going to end. You
can imagine that there are cases where, without changing the reward
too much it makes a lot of sense to try to go
ahead and quickly get to this plus 1, even though you
have some chance of falling into the minus 1. As opposed to,
trying to move away, where you're then kind of, all but guaranteed that
you're never going to reach the plus 1. So. Whether it makes sense to
take the risk or not will depend upon the reward but it's also
going to depend upon whether you have an infinite amount of time to get
to where you want to get to or whether you have a finite
amount of time. And the real major thing I want you to get
out of that is that if you don't have an infinite horizon but
you have a finite horizon then two things happen. One is the policy might
change because things might end. But secondly, and more
importantly, the policy can change, or will changee., even
though you're in the same state. So, if I
told you, if you're in this state right here, and
I told you you didn't have an infinite amount
of time, but you still had 100 million time steps
then, it, I think it's clear that it still
makes sense to go the long way around, right? Yeah,
I mean the, the probability that this policy is going to
last for a million timesteps has got to be tiny.
Right. So, I might as well. It's 100
million timesteps might as well be infinity. But if
I make that number not 100 million but I make it 2, or 3, or 4. Then suddenly
your calculus might change. In particular, your caluculus will
change even though I'm in the same state. Right?
So maybe this state right here, if I've got
a million, 100 million timesteps I still want got
to go the long way around, but if I've only got a few time steps, the only
way I'm ever going to get a positive reward
is to go this way. Does that make sense?
I guess so. So, you're saying, for
example, even within the single run, it could be
that I'm in a state and I try an action and maybe it doesn't work and I
stay where I am. And I try it again, and maybe it doesn't work and I stay
where I am. It might then switch to a
different action, not because the other one wasn't working,
but because now it's running out of time. Right, exactly. So
we talked about this notion of a policy which maps states to
actions, we talked about this notion about stationarity. So you believe
that this sort of Markovian thing said, it doesn't matter where I've
been it only matters where I am. And so if i'm
in this state, since it only matters where I am, I'm always
going to want to take the same action. Well that's only true in this
kind of infinite horizon case. If you're in a finite horizon case.
And that finite horizon, of course, is going
to keep counting down, every time you take a
step. Well then suddenly, depending upon the time
step that's left, you might take a different action.
So we could write that I think, just for the sake of kind of seeing it as
some thing like, your policy is a function both
the stature and, and the time step you're in.
Hm. And that might lead you to a different set of
actions. So this is important, this is important, I mean were not,
we are not, for, for this course going to talk about this
case at all, where you're in a finite horizon, but I think
it's important for you to understand that the, without this infinite horizon
assumption here. You loose this function of stationarity in your policies. Okay?
Okay. So, that all, I think is, you
know, making our something that's obvious, but becomes obvious
after someone points it out to you. So, the
second thing that I want to talk about, I think,
is a little bit more subtle. And, and this notion
of utility of sequences. So, as we've been talking, Michael, we
have been sort of. Implicitly discussing not just the rewards we
get in a single state, but the rewards that we get
through a sequences of states that we take. And so
I just want to point out a little fact that that comes
from that, and where that ends up leading us. And then
we'll get to some nice little cute series of math. So.
Here's what I want to point what utilities, what we mean by
utilities sequences. It means we have some function I'm going to call
U for utility over the state, the sequence, sorry. Of states
that we're going to see lets call them, S0, S1, S2 and
so on and so forth. Well, I think an assumption that
we've been making even if we haven't been very explicit about
it is that if we had two sequences of states. S0,
S1, S2, dot dot dot. And a different sequence S0, then S1
prime and S2 prime, that is two sequences that
might differ from S1 on, but all start in
the same start state. Okay? If we have a
utility for the first, and that utility happens to be
greater than the utility for the second, then it
also turns out that we believe. That the utility
for S1, S2, dot dot dot, is greater than
the utility for S1 prime, S2 prime dot dot dot.
so these are two different sequences, S one, the
S's and the S prime's are two different sequences.
And in the beginning we're comparing them with S0 stuck in
front of both of them. And we're saying if I prefer the
S0 followed by all the S's, to S0 followed by the S
primes, then I have that same preference even with those S0s missing.
Right, and so this is called stationarity of preferences.
And, another way of saying it is. That
if I prefer one sequence of states today over
another sequence of states, then I prefer that sequence
of states over the same sequence of states tomorrow.
So isn't, isn't this just obvious?
Because the whatever the rewards for those two
cases, we're just adding the reward we get
for S0. So. It's going to be the same.
But listen to what
you just said. You just said, well, it'll be the same, because all we're doing
is adding the reward for S0. But what did we ever say about adding up rewards?
I thought, I thought that's what we were doing.
That's right, that is what we were doing. But we
never actually sat down and wrote that down and said, this is
what it means. To talk about the utility of a sequence of
states as opposed to the reward that you get in one state.
Okay, so you're saying that if
we, if we are adding rewards, then this follows.
And then I've actually been saying something even stronger,
which is, I will show you on the next slide,
which is if you believe that this is true, that
the utility of one sequence of states is greater than the
utility of another sequence of states. Both today and tomorrow.
Then it actually forces you to do some variation of what
you said which is just adding sequences of states. Or
adding the rewards of the sequence of states that we see.
That's really interesting. So then, so the adding isn't
really an arbitrary thing it follows from this, this deeper assumption.
Right, and the reason I bring this up is
because. It would make sense if you were to just to
grab someone off the street and start talking about Marco Decision
Processes. One of two things will happened. Either they'd run screaming
from you like you're a crazy person or they would
sit and they would listen and if they listen they would
just completely buy into the idea that you just add up
sequences of rewards. You know, sequences of rewards that you see
as a way of talking about how good the states are because
that's a very natural thing to do. But it turns out that mathematicall
if you have this notion. A sort of stationary of preferences and this
sort of infinite arise in world. You really are in a case where
this has to be true. And it has to be the case
if you have to do some form of addition. Because nothing else sort
of can be guaranteed to maintain
this property over stationary preferences. I mean,
as you said, if I got one sequences of states and another sequnce
of states and by just prepending or appending another
set of states to it, I'm still going to always
guarantee that one's greater than the other. You kind
of have to do some form of adding the reward
that you see in the states in both cases.
because if you don't do that, then eventually this inequality
will not hold. So, let me write that down
in math terms. And see where that gets us, okay?
Sequences of Rewards Two
Alright, Michael. So, let's write that
down in math, okay? You like math, right?
Okay. So here's the math version of it. I'm going to
just say that the utility that we receive for visiting a
sequence of states, S0, S1, S2 ellipsis, is simply the sum
of all the rewards that we will receive for visiting those states.
Does that make sense?
Is that consistent with what you were
doing when you were thinking about the grid world?
Yeah, exactly. But I thought we were going to make that not
definitional, we were going to derive that it had to be that way.
No, we're not. They can read about it. It'd
take me, like, an hour and a half to do it.
And I'm already losing my voice. But. What I do want
people, I want you to believe is that the utility of the sequence
of states, thinking of it as the sum of all the rewards, makes
sense, right? And then if nothing else, at least it makes, it's consistent
with what you've been doing as we've been talking about this grid work.
Yeah, absolutely. I mean it, it also kind of makes me think about money.
Well, so, so, I mean, that's sort of how money works. If we get some kind
of payoff each day, those payoffs get added
to each other. They don't get, you know, subtracted
or multiplied or square rooted or whatever. They
just, you know, they go into your bank account,
and things add. So this feels like, kind
of like that. It's like money in my pocket.
Sure, if you have a big enough pocket.
Okay, I'm with you on that. Alright, so I'm going to
say that this all makes sense. It, it's really awesome
and it sort of doesn't work. And to illustrate why
this doesn't work, I'm going to give you a quiz.
because you like quizzes.
So I've been told.
Sequences of Rewards Quiz
So, here's the quiz. You see on
the screen this sort of unexplained two little squiggles
with a bunch of numbers in front, on top of them or under them or near them?
Yeah I, I assume that's a river and on one bank it's the
land of the plus ones and the other bank has been infilitrated by plus twos.
That's not what I intended at all but I actually like that enough that I'm
going to pretend, that that's what I intended. So
these work out to be sequences of states.
Oh, I see now.
and rewards that you receive. No, no, no, it's
a riverbank and on one side of the bank, as
you walk along it, you get a bunch of plus
ones. On the other side you get a bunch of
plus ones and some plus twos. And this just
goes on forever, okay? And, I'm not going to tell you
what all the rewards are after that except that they,
they look similar to the rewards that you see here.
Plus ones and plus twos, okay? And let's
say the top, the top one, in fact, nothing
but plus ones. And the bottom ones are some
plus ones with maybe some plus twos sprinkled in and
out, but those are really the only options. Okay?
Here's my question to you. So, would you
rather be on the top side of the river
bank, or the bottom side of the river bank?
Okay? You think you know the answer already, don't you.
Okay, cool. Well then, let's, let's give our listeners
a, a chance to come up with the right answer.
Sequences of Rewards Quiz
Okay Michael, I didn't give you as much time as I normally
do because you think you already know the answer. So what's the answer?
So, you know, I'd rather get the
plus 2's occasionally, so I'll say the bottom one.
Wait, what? You're not going to tell me the top one's better?
no, it's not.
Okay then that feels like a trick question.
It's not a trick question. The answer
is, neither one is better than the other.
Oh, so I had to give neither?
Which is better, and I can click
Or you could click on both. Okay. So you thought
that the bottom one was better, what was your reasoning for that?
Because, sort of moment to moment, sometimes
I'm getting plus one's, and that matches what I
would've gotten if I had been on the other
bank. But then sometimes I get plus two's, which
is actually better than what I would have gotten
on the other bank. And so, I'm, I'm, I
never feel any regret being on the bottom bank.
I only feel regret being on the top bank.
That's fine. So, what would you say
the utility of the sequence along the bottom actually is?
1 plus 1 plus 2 plus 1 plus 1 plus 2 plus dot, dot, dot.
Which is equal to?
infinity, I guess.
That's right. What about the utility of the top one?
1 plus 1 plus 1 plus, that's also infinity.
Yep. So the utility of both of those sequences is equal
to infinity. Do you think one of them is bigger than the other?
You drew the bottom one a larg, a little bit larger.
Or longer, anyway.
But, in the end, the sum of the rewards that you get are both
going to be infinity. So the truth is, neither
one of them is better than the other.
I still don't think you can say that both are better.
You can't get around that.
But yeah, I see, I see, neither, I can see neither is better.
Or that both are better. Neither is better, both is
better, whatever. The point is that, they're both equal to infinity.
And the reason they're
equal to infinity is because all we're doing is accumulating
rewards. And, if we're always going to be able to get positive
rewards no matter what we do, then it doesn't matter
what we do. This is the existential dilemma of being immortal.
Oh, living forever.
So if you live forever then, like why should you care about anything ever?
Right I mean, everyone, every all the mortal people are going to die and
one day they'll all be, you know, an infinite
amount of time in your past. I could do
this thing here, which is pleasurable, or I could
do this thing right now, that, you know, will, is
less pleasurable, but will eventually get me to a
better place. But if I'm going to live forever, and I
can always get to a better place, than it
really doesn't matter what I do. It really doesn't matter.
Because I'm just accumulating rewards,
I'm living forever, and I'm going to,
infinity is infinity and there's no really no way to compare them.
Having said that, your original notion that, look, it feels like I
should never regret having taken the second path compared to the first
because I will occasionally do better. Seems like the right intuition to have.
I see but it's just not built into this particular utility scheme.
Right, but it turns out there's a very easy way we can build
it into this utility scheme by just making one tiny little change. Would you
like to see it?
Beautiful, let's see it then
Sequences of Rewards
Okay Michael so here's the little trick. So all I've done is I
have replaces the equation at the top with an alternate version of the equation.
'Kay. So it looks there's, you're now exponentially blowing up the reward.
I am not exponentially blowing up the reward.
Instead of what I've done. Is I've added of
the see, the rewards that I'm going to see. For
the states that I'm going to see. And I multiplied
by gamma to the T. The gamma is between 0 and 1.
I see. So...
Well, kind of. So, so it
doesn't exponentially blow up. It exponencially implodes.
So, so things that are like a thousand steps in
the future If we multiply, if we take something that's less
than 1 and we raise it to that power, it goes
essentially to 0. So it's like it starts off the rewards
are kind of substantial and then they get, they trail off quickly.
Right. And in fact we can write down mathmatically what this is If
you actually, if you stare at it
long enough and you remembered your intermediate algebra
or your calculus or wherever it is
you learned this stuff. You would probably recognize
this as a special kind of sequence or series. Do you remember what it is?
No, but that's
remarkably close. So let's see if we can
bound this particular equation. So we know that
this version of the equation eventually ends up
being infinity if all the rewards are positive. Right?
So what does this end up being even in the case where all
the rewards are positive? Well, we can bound this from above by the largest
reward that we will ever see, in the following way. So all I've then
is said well I don't know what the rewards are but there is some
maximum reward I'm going to call it Rmax. And if I pretended that I always
got the Rmax reward. So long as that, as long as that's an actual number.
A finite number. I know that this is bounded from above by this expression.
Well what does this look like. Maybe this does look like a series you remember.
Yes. This is the geometric series. And this is exactly equal to this. .
And I assume that the summation
causes the max to become lower case.
Yes, yes it does. That's a, that's a trick that
they don't really go over deeply in calculus, but it's true.
Oh, that's cool. Alright so, the reward, the maximum reward
divided by 1 minus gamma. So when gamma is really close to
in the beginning and then everything falls off to nothing after that.
And if gamma is really close to 1. You're
dividing by something that's teeny tiny, which actually is like multiplying
it by something that's really big. So it kind
of magnifies the reward out, but it's not infinity
until gamma get's all the way up to 1
and that's the same case we were in before.
Right, so in fact you'll notice the way I
wrote this, gamma has to be between 0 and 1, 0
inclusive but strictly less than 1. If I made it so
that it could include 1, well that's actually the first case.
Because 1 to a power is always 1, and so,
this is actually a generalization of the sort of, infinite
sum of rewards. This is called discounted rewards, or discounted
series, or discounted sums, and it allows us, it turns
out, to go an infinite distance in finite time. Wait.
That makes sense.
No. No. [LAUGH] An infinite distance?
Yeah. That's at least the way I like
to think about it. So, if we're discounted in
this particular way, then that gives us a
geometric series. And what it does is it allows
us to add an infinite number of numbers,
but it actually gives us a finite number. Cool.
So this means we can still have our, and it
turns out, by the way, just to be clear, that this
world that we're in. Where we are in between 0 and
our stationarity over preferences. So we can add things together
and be consistent with those assumptions we were making before Or
we can do discounted rewards and we're still going to be consistent
with what we're doing before. But the difference between this case
and this case is that by doing the discount, we
get to add infinite sequences and still get something finite. The
intuition here is that Since, if gamma's less than 1, then
eventually as you raise it to a power, it will basically
become 0. I mean, that's why it's a
geometric series. Which is like having a finite horizon.
But that horizon is always the same finite distance
away, no matter what, where you are in time.
Which means it's effectively infinite.
Or unbounded, or however you want to think about it. So it's
like you take a step, but you're no closer than where, when you started.
Right, so, what does that mean in, a world
where you meant to say you're no closer than where
you were trying to end up? So that means that
even though, you've taken a step forward, the horizon sort
of remains a fixed distance away from where you are.
That's what it means to To, to have this kind of this
gamma value here and so you can still treat it as if it's always
going to be infinite, even though it gives you a finite value, and that
gives you your stationary. So does that make sense? I mean that's sort of
You really are have an in, you're
always in infinite, It's always in, infinity. But the
gamma value means that at any given point in
time, you only really have to think about what
amounts to a finite distance. You never get closer
to the horizon, even though you're always taking steps
towards it. But there still is a horizon that
you can see, and I can kill that analogy.
[LAUGH] So can we, can you explain to me why
it has this form, this R max over 1 minus gamma?
I could. There's a You can kind of prove that, in a little
cute math way, and I'm happy to take that diversion if you want.
Yeah, I think so. I mean, unless,
unless you were going to tell me something else.
Well how about I tell you something else and then I take that diversion?
Okay, so here's the something else. In
the beginning I said, well, this is like going
in infinite amount of distance in finite time, which
you rebelled against. And my explanation is more about.
How infinity looks finite which is not
the same thing as going an infinite amount
of distance in finite time. The reason
I said that is beacuse of the sningularity.
The singularity so, some of our listeners no doubt
have heard of the singularity. You never heard of the singularity?
The singularity is like when computers
get so fast that you do infinite computation
and, oh. Right. So the singularity, for those of you who don't know is is
this sort of observation that has been made by
many folks. That the sort of limit to computer
power growing faster is the fact that it takes us amount, some amount of time to
design the next generation of computers. Right, So
Moore's law says everything doubles you know, everything
could design things faster, then we could actually make
it double more quickly. So, one day we'll get
to the point where the computer, which I will
draw here, like that. The computer can actually design
the next generation of computer. Well, when it designs the
next generation of computer, that computer will also have
the ability to design the next generation of computer.
But it will be able to design it twice
as fast. And then the next one will be able
to design its next successor twice as fast, and so on and so forth. So
this little time between generations, will keep Having
every single time, which looks remarkably like a
geometric sequence. And so once we reach this
point, where the computer can actually design its
successor, we will then be able to do
an infinite number of successors. In finite time.
And that's when the world comes to and end.
And that's what's called the singularity. Because we can't understand
what happens after that point. So that's what I mean
by going an infinite amount of distance in finite time.
It just doesn't seem like distance.
But, yeah, that's a weird example, for sure.
I think it makes sense.
[INAUDIBLE] infinite number of doublings in it. Yeah, [INAUDIBLE],
no. [LAUGH] Well, we'll, we'll, we'll do we'll do some assignment
where we allow people to decide whether this
makes sense or one of your analogies make
sense. And I'll just remind the listener that
I'm the one who'll be assigning final grades.
So with that let's go and do your little diversion.
And and then we'll come back to. Doing a whole lot
of math. So I actually think this little nice diversion will
be warmup to all the math we're going to be doing. Okay?
Oooh. Hm. More math.
So, if we think about the equation that
we were doing before, which was you know, the sum
of all these gammas times some R max. We
can basically take out the R max and what we
end up with is something that looks like that,
right? You're summing together a bunch of gammas and then
multiplying the result by R max. So what does that
acutally look like? Well that looks like this. Gamma to
the zero, plus gamma to the one, plus gamma squared, plus dot dot
dot dot. Eventually multiplied by R max.
Right? So let's call this x, okay?
Does x include the R max or not?
No. So we'll just call the little sequence of,
it's going to turn out not to matter. Let's just call
the sequence of gammas we're adding together, let's just call that x.
Well, if you look at it, this is actually recursive. Right? Because this
is an infinite sequence, if you sort of shifted it one over in time, you
would end up with just the repeated sequence again. All right? Which means that
we can write x in terms of itself. x equals gamma zero plus gamma times X.
see, so it's shifted over, but then you have to multiply
it again by gamma, to get up to gamma one, gamma two.
Right, exactly. So, so, this is just,
you know, it's just math, that's just all
it is. so, we can try to solve for x and figure out what x is, right?
So, what is x? Well, we can subtract from both sides and so
we end up with like, something like x minus gamma x equals gamma 0. Right?
like we can stop writing gamma 0. Isn't that just one?
Yes, but I've already, I've gone too
far, Michael. I've alrea, I've already written gamma 0.
so, this becomes x times 1 minus gamma equals gamma 0, or as Michael so
astutely points out, 1. [LAUGH]. Alright. Which means what? It means that.
So then we divide by 1 minus gamma.
a is 1 over 1 minus gamma.
of course, we are going to multiply by R max.
To get the formula that we had. That's, yeah, that's nifty.
Yeah, there we go. So, why do we do this? Well, because you like
stuff like this and it points out something
very simple, which is that geometry is easy.
[LAUGH] I don't think that's the
kind of geometry that people usually think of.
Well, they should be thinking of this kind of geometry.
So geometry is easy. And by doing a little cute algebra,
we can derive ridiculous equations that turn out to help
us deal with go an infinite distance in finite time.
I don't think that's what they're doing, but okay.
[LAUGH] So, let's think of this as warmup for what I actually wanted to
show you, which is going to turn out to be a whole lot of math. Okay?
Alright, let's, let's go for it.
Alright, let's do that.
Okay, so Michael, in the spirit of what we just went through in
deriving the geometric series, I'm now going to write down a bunch of
math. And what I'm going to do is I'm just going to say it
at you, and you're going to interrupt me if it doesn't make sense. Okay?
That makes sense.
It does. Okay, so here's the first thing I'm going to
write down. I've decided that by going through this excercise of Of utilities
in this kind of reward, we can now write down what the
optimal policy is. The optimal policy, which as you recall is simply pi
star, is simply the one that maximizes our long-term expected reward. Which
looks like what? Well, it looks like this. There, does that make sense?
Let me think, so. We have an expected value of the sum, of the
discounted rewards, at time t. And, given, pi. Meaning that
we're going to be following pi?
Mm-hm. So these are the, the sequence of states
we're going to see in a world where we follow pi.
And it's an expectation because, things are non-deterministic. Or may
be non-deterministic. And do we know which state we started?
It doesn't matter, it's whatever s zero is.
I see. Whatever s zero is, but isn't that random? I
mean s one and s two, s three; those are all random.
Well, we start at some state, it doesn't matter, so t is starting
out a zero. And going to infinity. Okay? So does this make sense?
Yes, so then, so we're saying, I would
like to know the policy that maximizes the value of
that expression. So it gives us the highest expected
reward. Yeah, that's the kind of policy I would want.
Exactly. So, good, we're done, we know what the
optimum policy is. Except that it's not really clear what
to do with this. All we've really done is written
down what we knew it was we were trying to solve.
But it turns out that we've defined utility in
such a way that it's going to help us to solve
this. So let me write that down as well.
I'm going to say that the utility of a particular seque,
of a particular state okay. Well it's going to depend
upon the policy that were following. So I'm going to rewrite
the utility that takes the superscript pie. And thats
simply going to be the expected set of states that
I'm going to see from that point on given
that I've followed the policy. There, does that make sense?
It feels like the same thing. I
guess the difference now is that you're saying the
utility of the policy out of state is
what happens if we start running from that state.
Yep. And we follow that policy.
Right. So, this answers the question you
asked me before about, well, what's S0? Well, we
talk about that in terms of the utility of the state. So how good is it to be
in some state? Well, it's exactly as good to
be in that state as what we will expect to
see from that point on. Given that we're a
following a specific policy where we started in that state.
Does that make sense?
Very important point here, Michael, is that the
reward for entering a state is not the same thing
as the utility for that state. Right? And in particular.
What reward gives us is immediate gratification or immediate feedback.
Okay? But utility gives us long term feedback. Does that make sense? So
when reward [UNKNOWN] is the actual value that we get for being in that
state. Utility [UNKNOWN] state is both the reward we get for that state.
But also, all the reward that we're going to get from that point on.
I see. So yeah. That seems like a really important difference.
Like, if I say, here's a dollar. You know? Would you poke
the president of your university in the
eye? You'd be, like, okay. The immediate reward
for that is one. But the long term utility of that could be actually quite low.
Right. On the other hand, I say, well, why don't you go to college?
And you say, but that's going to cost me $40.000. Or better yet, why don't you
get a masters degree in computer science
from Georgia tech, bu you can say that's
going to cost me $6600. Yes, but at the end of it you will have a degree.
And by the way it turns out the average starting salary for
people who are getting a masters degree or undergraduate degree is about $45000.
So is it considered product placement if
you. Plug your own product within the product itself?
No, I'm just simply stating fact
Michael. This is all I'm doing. Just facts.
This is called fact placement.
The point is, there's a, an immediate negative
reward, of say, $6,600 for, I'm going through a degree.
Or maybe it's $10,000 by the time, the 15th person
sees this. But anyway, it's some cost. But, presumably it's
okay to go to college, or go to grad school,
or whatever. Because at the end of it you are going
to get something positive out of it. So it is
not just that it prevents you from taking short term positive
things if that is going to lead to long term
negative things. I also always you to take short term negatives
if it will lead to long term positives. That makes
sense. What this does is this gets us back to
what I mentioned earlier. Which is this notion of delayed
reward. So we have this notion of reward, but utilities are
really about accounting for all delayed rewards. And if you
think about that, I think you can begin to see
how, given you have a mathematical expression delayed rewards, you
will be able to start dealing with the credit assignment problem.
Okay, so let's keep going and write more equations.
Okay Micheal, so I've erased the screen and kept Bellman's equation,
the most important equation ever and we are going to solve it.
So, how are we going to make that work? Okay, so how many, so
we wrote this down as a utility of s, how many S's are there?
s, n, m, I don't know?
Pick a letter.
N, so we N states which means this
isn't really one equasion. This is how many equations?
Yes it's N equations. How many unknowns are there?
Well we have U, the Rs are known, the Ts are known, the only
things that's missing is the Us. And there's, oh and there's N of those as well.
Right there's N equation in N [UNKNOWN].
So we're done. Yes because we know
how to solve n equations in n unknowns, right?
If the equations were linear.
If the equations were linear. Are these equations linear?
I'm going to go with no.
because the max is problematic.
That's right. This max operation makes it nonlinear.
So, it looks really good for a moment there. We've
got n equations, n unknowns. We know how to solve
that. But max makes for a very very very weird
non linearity. And there's kind of no nice way
to deal with this. Actually, one day, Michael, if you
ask me, there is a cute little aside you can
do where you can turn max into something that's differentiable.
doesn't actually help us here so I'm not going to go off on that aside yet.
But even differentiable wouldn't quite be linear.
That's right. And it wouldn't help us in
this case. Yeah that's exactly right. But the fact that
you can have differentiable maxes I think actually [UNKNOWN].
But also unimportant for what we're talking about now. So,
we've got n equations, n unknowns, they're nonlinear, which
means we can't solve it the way we want to
by like inverting matrices or like. You people are in
the regression would normally do but it turns out that
we can in fact, do something fairly
similar, something that actually allows us to
solve this even though it is nonlinear.
And here's the equation, or the equation,
here's, here's the algorithm you use that sort of works, okay? So it's really
simple. Just simply start with some arbitrary
utilities. Declare those the answer and you're done.
That would be one way of doing it but it turns out we can do even better.
Wait. Start off with the correct utilities.
would work except we don't know what they are.
So we're going to start with some arbitrary
utilities, but then we're going to update them based on
their neighbors. And then we'll simply lather, rinse, repeat.
Alright, so what does that mean based on neighbors?
So, it means, based on the states, you're going to update
the utility for a state based on all of the states that
it can reach. So, let me write down an equation that tells
you how to update them and then maybe it'll be clear, okay?
So we're going to have a bunch of
utilities since we're going to, we're going to be looping
through this. And let's just say every time
we loop through, that's time t. Okay? So we
know what the equation for utility is where the utilities are. It's just the
equation that's written up here. So it's r of s plus gamma
times the mass over a of. The expected
utility. Right? Except we have some estimate of the utility
at time t. Okay? and, probably the right thing for me
to do would be to write this like as you
had or something like that. This is my estimate of the
utility. And so now using this, I'm going to want to sort
of update the, you know all of my utilities to make
them better. And it turns out I can do that
by simply doing this. So I'm going to update at every iteration
update utilities based on neighbors I'm going to
update at every iteration, at every iteration my
estimate of the utility of some state S by simply recalculating it to be the
actual reward that I get for entering
state S, plus the discounted utility that I
expect given the original estimates of my
utility. Does that make any sense at all?
Yeah, I guess so, so. So the s, okay, so, so we
need the whole u hat t.
Like at all states, because we're not just using the
values that state s to update the values that state s.
We're using all the values, at the, at the previous time
step to update all the values to the current time step.
So, so all these n equations, they're all tangled together.
Right, because of this This, expectation. So really, just to
make it clear, this should be a summation over s prime.
Finding Policies Two
So, I'm going to update the utility of s by looking at the utilities of
all the other states, including itself, as prime. And weight those based on my
probability of getting there given that I took an action. And what action am
I going to take? Well, I'm going to take
the one that maximizes my, expected utility.
So it's sort of like figuring out what to do in a state assuming that this
uhat really is the right answer for the future.
Right. Now why is that going to work? So
I just made up the uhats, right? They started
out as arbitary utilities. The next little step about
updating utilities based on neighbors makes some sense because effectively
is any state that you can reach. Which is
determined by the transition function. But, all I'm doing
is reaching states that are also made up of
arbitrary utilities so, arbitrary values. Why should that help me?
Well, it's because of this value right here. This
is actual truth. The actual word I get for entering
into a stage, is a true value. So, effectively
what's going to happen is I'm going to be propagating out the
true value, or true reward, the true immediate reward
that I get for entering into a state, say. Through
all of the states that I'm going to see and
propagating that information back and forth. Until ultimately I converge.
not obvious why we should converge, because
we start off with an arbitrary function and
it seems like that could be really wrong. So we're like adding truth to wrong.
Yes but then, the next time
around I'm going to, I've been adding truth twice
to wrong, and then more truth to wrong, and more truth to wrong. And eventually,
I'm going to be adding, more and more and more and more and more and more
and more and more and more and more
truth, then it will overwhelm the original wrong.
And is that, does it help that the
wrong is being discounted?
Yes, it helps that the wrong is being discounted.
Does it help that the wrong is being discounted? I actually
don't know that matters. I'll have to think about that for
a moment. It certainly doesn't hurt. But I, I guess the
way that I think about this, I mean there's an
actual proof you can look it up but in the end
I think, I, I tend to think of this as a
kind of simple contraction proof that effectively at every step you
have to be moving towards the answer. Because you've added in more
of the reality, more of the truth. And if you remember the
utility of a state is just all of the rewards that you're
going to see. So, basically at every time step you have added the reward
that you're going to see, and then all of the rewards you're going to
see after that. And so you've gotten a better estimate of actually the
sequence of rewards you're likely to see from the state. And if
that gets better for any of the states, then eventually that betterness will
propagate out to all the other states that they can
reach or can reach them. That will keep happening and you'll
keep getting closer and closer and closer for the true
utility of the states until you eventually run out of closeness.
Does that make sense at all?
Well does it also help that the gamma is less than one.
Yeah it does. the, the way I like to think of this
as, is as a sort of contraction proof, that makes, if you've heard
of those. So, the basic idea here is that you
start out with some noise, but at every step, you're
getting some truth, and that truth gets added, and then,
the next dideration, more truth gets added, and more truth
get, gets added. So, as you have some estimate of
some particular state S, you get to update it based
on truth. It's actual reward. And you bring in more
truth from the other utilities, as well. As this particular utility
gets better. That closeness to the true utility then
gets spread to all, from, to all the states
that can reach it. And because the gamma is
less than one. You basically get to overwhelm the past
in this case which is the original arbitrary utilities.
And so as you keep iterating through this, the latest
truth becomes more important than the past less truth.
And so you are always getting closer and closer and
closer to the truth until you eventually you do. At
least that's the intuition that I like to think of.
Yeah I, I kind of get that as an
intuition, though I'd probably be happier going through the math, but.
Well, we could do that, and by we, I
mean the students can do that by actually reading the proof.
Okay, so cool. So, this right here is a an easy way to find the true value
of states. And you do it by iterating and it kind of has a name.
It kind of has a name.
Yeah, what do you think the name could be?
Bellman. Bellman's algorithm.
No. No though that's probably reasonable.
Yes, except utility sounds better if you say value,
so it's value iteration. And it works. Remarkably well. So,
and it doesn't, doesn't give you the answer but it
gets you something that is closer and closer to the answer.
Right. And, eventually it will converge. You'll
get so close to converging it doesn't matter.
And, once you have the two utilities, if you recall. We
know how to define the optimal policy in terms of utilities. So,
if I give you the utilities, then the optimum policy is
just, well I'm in a state. Look at all the states I
might get to. Figure out the expectation that I'm going to get
for a given action. Pick whichever one is maximum and I'm done.
So solving for the utilities are the true value of a
state. Is effectively the same thing as solving for the optimal policy.
I think so.
Okay, Michael, so you seemed like you knew what you were doing
so I thought I would verify that by giving you a very easy quiz.
This doesn't look so easy.
It is easy, so here's the quiz. To help you I've written up both the
Bellmans equation and the update that we would give to utilities. I neglected
to write the little hats and everything because I just don't think you need that
but these are the two equations that you need to be able to think about.
This is just what we've written up before. And I've written down the
grid world we've been playing with the entire time. And I want you to
figure out how value iteration would work from the first iteration and the
second iteration for this particular state here that I marked with an X, okay?
Alright, and there are a little more information you
need to go. Gamma in this case is going to be
equal to one half. Mainly because it's easy to do
the math if you do it that way. The rewards,
just to remind you, for all of the states except for the two goal
or absorbing states is going to be minus
all of the states, is going to be 0, except in the two absorbing
states where I already know the utilities are 1 and minus 1 respectively, Okay?
You got all that?
I think so.
[LAUGH] I feel confident
that I'm going to slip up, but Okay, yeah, I think I can take a stab at this.
Alright, so gamma's one half, rewards are minus 0.04, arbitrary
starting utilities at times 0 is 0, except here at the absorbing
states. Tell me how the utility here will evolve after one step,
or one iteration, and two steps, or two iterations. Okay then, go.
Alright, Michael. What's the answer?
I think I've already made a mistake.
Okay. What's the mistake?
Suggesting that we do a quiz.
[LAUGH] I'm pretty sure that's always true. Unless
Michael, by doing a quiz now and suffering pain,
you, in the future, are a better person. In
which case, you have made the right long term decision.
I see. You're, this is a MDP metapoint you're making.
Alright. So let, but let's, let's just buckle down and do
this thing. So.
Alright, so at that state x, we have
to consider, according to the equation, we're going to do
U sub one at x. And that's going to be
the reward at x, which is minus .04. Mm-hm.
Plus gamma, which is a half.
Feel free to write these things down.
Okay, so, let's see. It's going to be minus .04 plus one half times.
Alright. So now we need to do four different actions.
So, I don, I would make like a brace-y
thing at this point. Not a bracket, but a brace. Alright?
Or I could do a bracket, because you're going to
notice immediately that it's obvious what the right action is.
Okay, alright. Well, we know that the, the
right action's going to be to go to the right.
Yeah, but even then, you know you don't
have to do the rest of the computation because my
first guess at all the utilities is
that they're 0. Which means you're always going
to want to take the action that gets you to plus one with the highest chance.
So there's just no point in.
I see. Okay. Fair enough. Thank you for.
The shortcut. So, so we only have to
do the one action which is to the right.
And so if we go to the right,
there's three possible next states we could go in.
One is back to x, which has a value of zero.
One is to the thing underneath of x, which has a value of zero.
And then the last one with probability 0.8 needs to go to the plus one.
0.8 times plus one.
Which is 0.8.
Okay. And that is?
Okay. So 0.8 times a half is 0.4 minus 0.04 is 0.36.
Yep. And that is correct.
Okay, so to
do the same thing for you too, we're going to need the
U1 value for a bunch of other states, seems to me.
Maybe, so let's right that down. So we
know for now the utility here is .36 right?
And you're saying that in order to do U two,
I'm now going to have to, you know, I was able to avoid
doing some of the math before because, these are all zeroes.
So it was just easy to do. But when I went right,
I either stayed where I was, went to the
plus one, or I ended up going down. Well I
think by the same argument that allowed us to cheat
our way out of to cheat our way out before.
It's still going to be best to go to the right, so.
Yeah but I know that but the value itself
is going to depend on the value in several other states.
Yeah well how many other states?
Oh, just that one.
Just this one,
so what was U one of this state? I see. So, presumably, we want to avoid falling
into the pit, so the, the best thing we
can do is bash our head against the wall,
Which will get us a -.04 for that statement.
Okay, alright, so maybe this isn't so, so bad.
So now for our u2 values we need -.04
plus a half times point, point one times point 36. Plus.
0.1 times negative 0.04, so that's minus 0.004,
plus 0.8 times one. Oh, which is 0.8 again, just like it was last time.
And I get 0.376.
Which is what I get by also getting
out your calculator. Okay, so 0.376, and you can imagine
how we would do that on and on. I want
to point something out, Michael, which is that. You decided
to figure what the true utility was for the
state under X by bashing your head into the
wall. But you know that based on the discussion
we had earlier that actually the optimal policy would involve
going up instead of bashing your head into the
wall. What you did at that point was in
fact right because everything else in this utilities were
all zero. The best thing you could do is avoid
getting, avoid ever falling into minus one, so the policy
of the very first of bashing your head into the
wall, is infact the right thing to do at that
point. But what youll notice is that next time around
the utility for the X state Is bigger than zero.
In fact, will keep getting bigger and bigger than zero.
As you can see, it went from .36 to 0.376.
Which means that at some point it's going to be worthwhile
to try to go up instead of bashing your head into a wall.
I see. So this, so this is kind of cool that it
works. But, it does seem like a really roundabout way of getting there.
I mean, is there some way that we could some, I don't
know, maybe take advantage of the fact that there's not that many policies?
Yeah, so you actually said something, fairly
subtle there, so let's see if we can
unpack it. So, lemme point out two things, which I think will get us to what
you're answering. The first is. Do you realize
that the reason that the value iteration works is
because eventually value propagates out from its neighbors,
right? The first time we can calculate the .36
without really worrying about anything around it because
the utilities are all zero, and in fact, based
on the initial utilities. Even for this state
here, we do the wrong thing. But after some
time, this state becomes a truer representation of its utility
and, in fact, gets higher and higher. The right thing to
do here will go up. You'll also notice that after
another time step I'm going to need to start looking at
the value of this state as well. Alright. So, eventually
I'm going to have to figure out the value, the utilities
or the values of all of these states and this
plus one is going to propagate out towards the other states.
Where this minus one will propagate out less because you're going to try
to avoid falling in there. So that
makes sense, right? But what's propagating out,
Michael? What's propagating out is the true utilities, the true values of these
states. But what's a policy? A policy is a function from what to what?
States to actions.
Right, a policy is the mapping from
state to action. It is not a mapping from
states to utilities.
No, that's what U is.
That's what U is.
Or that's what you are.
So, [LAUGH], so if we have U, we can figure out pi, but U
is actually much more information than we need to figure out pi. If we have a
U that is not the correct utility, but
say, has the ordering of the actions correct,
then we're actually doing pretty well, right? It
doesn't matter whether we have the wrong utilities.
I see. Uh-huh.
we did this in the, the first third of the class
as well when we noticed that we were computing in the
Bayesian learning case actual probabilities. But we don't really care about
actual probabilities. We just care that the labels are right. There's
a very similar argument here. We don't care about having the
correct utilities, even though by having the correct utilities, we have
the right policy. All we actually care about is getting the
right policy. And order matters there rather than absolute value. Does that
Yeah, that's interesting. It's almost kind of like pi
is more of like a classifier, right? It's mapping. Inputs
to discreet classes and the user kind of like, more
like regression where its mapping these states to continuous values.
Right and given one, given the utilities, we
can find pi, given pi there's an infinite number
of utilities that are. Consistent with it. So, what
you end up wanting to do is get a utility
that's good enough to get you to your pie. Which is one reason why you don't
have to worry about getting the absolute convergence
in [UNKNOWN]. But it gives us a hint of
something we might do that's a little bit
better with the policies that might go faster
in practise. So, I'm just going to take three
seconds to give you an example of that. Okay?
Finding Policies Three
Okay so Michael let's see if we can do something that
might be faster and just as good as what we've been
doing with value durations. And what we're going to do is we're
going to emphasize the fact that we care about policies, not values.
Now it's true given the true utilities we can find a
policy, but maybe we don't have to find the true utilities. In
order to find the optimal policy. So, here's a little sketch
of an algorithm okay, so it's going to look a lot like value
to ratio. We are going to start with some initial
policy, let's call it pi sub zero and that's just
going to be a guess, so it's just going to be an
arbitrary setting of actions that you should take in different
states. Then we're going to evaluate how good that policy is, and
the way we're going to it at time T. Is to calculate its utility,
which I'm going to call U sub t. Which is equal to the utility you get by
following that policy. Okay? And I, I'll show you in a moment
exactly how you do that, alright? But I just want to make certain
that you, that you, you kind of buy that maybe we can
do that. So, We have, given a policy, we ought to be able
to evaluate it by figuring out what the utility of that policy
is. And, again, we'll talk about that in a second. And then, after
we know what the utility of that policy is, we're actually going to
improve that policy in a way similar to what we did with value
duration, we're going to update our policy, Time T plus one, to
be the policy. That takes the actions that maximizes the expected
utility based of what we just calculated for [INAUDIBLE] of T.
Now notice it will allow us to change Pi over time
because image that we discover that in some state that actually
there is an action that is very good in that state.
That gets you some place really nice gives you a really
big reward and that went on and you do fairly well.
Well then all other states that can reach that state. Might
end up taking a different action than they did before, because
now the best action would be to move towards that state.
So these two steps will actually, or can actually lead to some
improvement of the policy over time. But the key thing here
is we have to figure out exactly how to compute u sub
t. Well, the good news is we know how to do
that, and it's actually pretty easy. And it boils down to our
favorite equation, Doman's equation.
So our utility at time t, that is the utility that we
get by following a policy at time t, is just well, the true reward
that we're going to get by entering that
state plus gamma times the expected utility,
which now looks like this. There, does that make sense? Do you see that?
Okay, hang on, it looks a little
different from the other equation. So did you
mean for it to have T UT is defined in terms of UT and not UT minus one?
Okay, that's interesting. And the max is gone but
instead of max, there's a policy. Stuck into the transition function.
A choice of action is determined by the policy.
Right. And that's actually the only difference between what we were doing
before is that rather than having this max over actions, we already know what
action we're going to take. It's determined
by the policy we're currently following.
Okay, but isn't this just as hard
as solving? The thing with the max, you said?
Well, what was the problem that we were solving before with the max?
That was the Bellman equation.
Yes, but we were solving a bunch of equations. How many of them?
So we were solving n equations, and how many unknowns?
What's the difference between this, n equations and
n unknowns, and the other n equations and n unknowns?
Well, n is the same.
There's no max, though.
There's no max. And what was it that made solving that hard before?
It made it, the max made it non linear. The max is
gone now. You're saying this is, this is a set of linear equations?
Yeah. Because, well, there's, there's just a bunch of sums. And the
pi is not like some weird function. This is just effectively a constant.
So now, I have n equations, and n unknowns.
But it's in linear equations. And now that I
have in linear equations and unknowns, I can actually compute
this is a reasonable amount of time, by doing
matrix inversions, and regression, and other magic hand wavey things.
That's very slick.
It's seems, it's still more expensive than doing the
valued [UNKNOWN], I guess. Yeah, but you don't have to, perhaps,
do as many iterations as you were doing before. So
once you've evaluated it, which we now know how to do,
and you've improved it, you just keep
doing that until your policy doesn't change.
Mmhm. And this does look a lot like value iteration to you, doesn't it?
Yeah, though is seems like it's making bigger jumps somehow.
It is, and that's because instead
of making jumps. In value space, it's making
jumps in policy space. Which is why
we call this class of algorithms Policy Iteration.
Now, this inversion can still be fairly painful, it's, you
know, if we don't worry about being highly efficient, you know,
it's roughly n cubed, and if there are a lot of
states, this can be kind of painful. But it turns out there's
little tricks you can do, like do a little step
evaluate iteration here for a while to get an estimate and
then, you know, kind of cycle through. So there's all kinds of
clever things you might want to do, but at a high level
without worrying about, you know, the, the details of constants, this general
process of moving through policy space. And taking advantage of the fact
that by picking a specific policy you're able to turn your nonlinear
equations into linear equations turns out to often to be very helpful.
So, is it guaranteed to converge?
Well there, that was easy. I'm, I'm not going
to go into it but, you know, there's a finite number
of policies. You're always getting better so eventually
you have to converge. It's very similar to
the, or at least intuitively it's very similar
to the argument you might make for [UNKNOWN]. Cool.
Okay Michael, so, this is it. Why don't you help me remember what we
learned in this one marathon session that
was not all distributed over multiple months.
[LAUGH] Sure. Or years.
So, MDPs. So, we talked about mark-off
decision processes, and we said what that meant. [LAUGH]
Okay, so, that's the first thing we did is we talked about MDPs.
And that, that MDP consists of states and rewards and actions and
like that, transitions, discounts.
So you said discounts and I just want to point out that there are
some people who think that that's a part of the definition of the problem
and there are some people who think that that's more of a parameter to
your algorithm. okay, alright. So there's some
people who think of it the right way.
Like me. And some people think of it the wrong way. So, I
tend to think of the discount as
being something that you're allowed to fiddle with,
as opposed to it being a fundamental part of the.
Then why not fiddle with the rewards?
Sure. You are allowed to fiddle with the rewards.
Oh! You are very open minded!
I am open minded, i believe that rewards should be able
to marry other rewards. In any case, the important thing here is
that there is some under lying process that you care about that
is suppose to represent the world,
states, rewards, actions, and transitions and capture
that fairly well. Discount is, in some sense, an important part of the
problem, because it tells you how much you want to care about the future
versus the past. But it's reasonable to think of that as something that you
might want to change outside of the kind of underlying physics of the world.
I see. Okay, that's fair. Yeah, in some sense, the states and
the actions and the transitions represent.
The physical world and the rewards and
the discount represent the kind of task description.
Right. And of course, you say that, but
if you, you could decide to define states differently, and
doing that would impact both your actions and the
transition function and so on and so forth. But the
basic idea is right, which is, there's some underlying
process we're trying to capture, and I think it's exactly
right to say, states, actions, and transitions. Sort of capture
that. And rewards and discounts capture more about the nature
of the task, you're trying to do in that underlying world.
Okay. And in, in that context we talked
about two really important concepts. Policies and, value functions?
Which we sometimes call utilities.
And how do utilities differ from rewards? The utilities factor in the
long term aspects, and the rewards are just telling you the moment to moment.
Utilities are like a group of rewards. Like a gaggle.
Or a murder.
Of crows. So, that we talked about how we
can assign value to an infinite sequence of rewards. Mm-hm.
But it helps if we use discounting to
do that so that we don't get infinitely large sums.
And that allowed us to deal with infinite sequences, but threat them as if
their value is, in fact, finite, thus solving the immortality problem.
[LAUGH] And, let's see, then we kind of.
And the stationarity was a really important part of that.
Yep. In fact kind of drove everything.
Yeah. And all these things were
tied up together in the bellman equation itself.
Yes which is an awesome equation and deserves capital letters. [LAUGH]
Is that it? And then, well then we solve the Bellman
equation using value duration and policy duration.
I think that was it. Alright,
so are any of these polynomial time algorithms?
Well, the way we've been talking about them, no, but you can map
these into linear programs and turn them
into polynomial problems, or polynomial. So, yes these
problems can be solved that way. But
actually, that reminds me, of something that
we haven't said and that we haven't
learned today, that I think is worth mentioning.
Which is we been talking about, you know, this section
of the class is about over the course is about
reinforcement learning. But, we actually haven't done any reinforcement learning
here because we know everything, we know the states, we
know the rewards, we know the actions, we know the
transitions. We have some discount, we have been solving MDP's,
but that's not quite the same thing [SOUND] as doing
reinforcement learning, however, it's very important to do these things,
to make it easier to think about how reinforcement learning works. So,
in reinforcement learning, the difference is,
you don't necessarily know, the rewards
and the transitions or even the states and the actions, for that
matter. I see. So when are we going to get into reinforcement learning then?
Next time. And when I say we next time, I mean you next
time. That's your assignment. Learn all about
reinforcement learning and talk about it next time.