- Hi Michael.
- How are you today?
- I'm doing just fine. How are you doing?
- Alright. I'm a little out of practice with
this lecturing thing. So, I hope this goes well.
- I'm absolutely sure it will.
- So, today's lesson is continuing what you were talking about in
terms of Game Theory. But, I'm going to be focusing in on what happens
when you are worried about making decisions. With more than one player in
a sequence. Which we started to get into at the end of your
discussion but I'm going to go, more into it.
- Okay. I really like the logo by the way.
- Thanks very much. Yeah, it's a, it's a specially game theory logo.
- I like it very much, I like it very much. I
will point out, however, that all sequels should be called the quickening.
- Yeah, I was going to go with the quickening. Or
judgement day, but I didn't, didn't think that made any sense.
- Hm, that's a fair point.
Iterated Prisoners Dilemma
So let me take you back into what we
were talking about last time. We were talking about
the iterated prisoners dilemma, and here's the prisoners dilemma
payoffs that you wrote down for us. You remember this?
- I vaguely remember this.
- And remember it was about these two criminals, Smooth and Curly. And they
were, deciding whether to cooperate or defect
against each other when, after they've been arrested.
- Right, and they defect, they always defect. They always
defect, right. So in particular, we say well what happens if it's, if they have
multiple rounds in which to interact, and so here they are, here are the two
of them, and if they've got one round to live, we did an analysis and
we indicated that there's really nothing they
can do other than defect against each other.
- It's irrational to do anything else.
- So we said, all right, well, what happens if we allow
there to be more than one round? So now we've got two rounds.
And what we realized was that if you got
two rounds to go, then these players essentially, face a
one-round game because, after this round, what they're going to do
in the last round has already been determined. So it's
almost as if that round doesn't really matter. There's nothing
we can do now that's going to change what they're going
to do in that last round, so it's sort of
like there's just one round and we're going to defect again.
- Right, so life is terrible and everyone is out to get everyone else.
- Exactly, and
not only is it for two rounds, but this
same argument continues as you go three rounds or more.
- Oh, it's like a proof by induction.
- It's kind of like a proof by induction, yeah, well it's proof by ellipsis.
- Mm, that's my favorite kind of induction. Proof by ellipsis.
- [LAUGH] So the question then becomes, what
happens if the, number of rounds left is unknown,
right? So what we've realized is that if
you know how many rounds are left, the whole
thing comes unraveled and they're just going to defect forever. But we raised
the issue of, what happens if the number of rounds left is unknown?
- And it seems like it shouldn't really make any
difference because if it's say some finite number we just
don't know what it is, then it seems like it
should still reduce to this same setup that we have.
So I was looking into this and it turns out
that is it's not the case it actually does make
a difference and it's, it's really interesting how it goes
and how it connects back with other things we've talked about.
- Woo, tell me more.
So here's how I started to think about it. So let's say how
can we represent the idea that we have an uncertain ending. We'll one way
would be if we had some kind
of generic probability distributions over the number
of rounds that the games going to be played. But this seems like an,
the simplest idea that I could think of. So here, here we have our,
two criminals, and what their going to do is their going to play a
round of prisoner's dilemma. But at the end of that round, they're going to
flip a coin. And with probability one minus gamma, that will be the last
round, it's all over. But with probability gamma,
they're actually going to play again. And, and they do
this after each round. And so each round
is, is basically statistically independent of the other rounds.
- I see.
- So the set up here is, with probability gamma,
the game continues. Now, notice that I chose gamma here.
This was a, it's representing a probability here, but in
the past, we've used this to represent a discount factor.
- But that actually is the same thing, right?
In, in, in the normal discounting, we say that the
value that you get one step from now, is discounted
downward by gamma. And that's exactly what you'd take if
you worked out the expected value of a game where you
continued making steps with probability gamma. And with probability one
minus gamma, it ends if you get zero from then
until the rest of time. So every round here could
be your last, or not, right? It could be that you
actually get to continue playing. Does that make some sense?
- That makes perfect sense.
- Awesome. All right, so, so yeah, this is,
this is exactly that kind of situation where, well,
here, let me, let me ask you a question.
What's the expected number of rounds of this game?
- Well, I'll bet it's finite if gamma's
less than one Yes, I even wrote that down.
- Yeah, I'm smart. Or at least I can read.
- Sure, but what's the, but, specifically we could actually write
it as a, as a, function of gamma.
- Let's see. If gamma were something like, 99% then
I would expect it to be about a 100, right?
- I think that's right.
- So is that, is that your answer? [LAUGH] My function
of gamma is if gamma is .99 the answer is 100.
- Yeah, something like that. It's not a total function but it's a function.
- Well, it's a sample. I mean, you
do machine learning. Why don't you tell me what
the function would be given that sample.
- Well, we can make it a quiz or I could just tell you.
- Why don't you just tell me.
- Alright. So one over one minus gamma
is the answer. It works for your example.
- 1 minus .99 is 100th and we're talking 1 over that so,
you get a 100. And, yea we could go through the argument as to
why that's that's what it is. But this one over one minus gamma
is what shows up all the time. If gamma is zero, then we're talking
about one over one. The game lasts one round. That's exactly what we'd expect.
- Mm-hm As gamma gets closer and closer to one
this pro, this quantities getting closer and closer to infinity. So,
- in fact if you know, it becomes unbounded as gamma hits one. So
yeah. So this is the expected of rounds, and so that means like yeah.
So as you said if gamma is 0.99, it's a 100 rounds. And we
already, reasoned that at a 100 rounds the whole thing falls apart. Right, huh,
and I noticed the one over one minus gamma, of course, is just like
the way we did discount factors, when we started doing MVP's in the first place.
- Exactly, yeah, that, that kind of links them together.
- Hm, that's actually kind of neat.
Tit for Tat
So if we're going to be talking about strategies
in this game that has an uncertain ending,
we can't just write down sequences of, of
actions anymore. We can't just say cooperate, defect, defect,
defect, defect. Or even some kind of tree
of possibilities. Because those are going to be finite
representations. We need some other representation that allows
us to play for an unbounded number of rounds.
- And I'm going to start off by presenting
an example of such a strategy, one that's,
that's very famous for the iterated prisoner's dilemma,
and it's called tit for tat. And the structure of tit for tat goes like this, on
the first round of the game, A player
playing this strategy will cooperate, and then in
all future rounds, the player is going to
copy the opponents previous move. Does that make sense?
- It does. So basically, we start, I
start out, acting as if, you're going to cooperate
with me. And the moment you don't cooperate with me, I will start
to defect, and we'll be in the, the old style prisoners' dilemma. Right?
- Well no, not. What this says is that it actually copies
the opponent's previous move. So if, if an opponent goes cooperate, defect,
cooperate, defect, cooperate, defect, defect, defect,
defect, cooperate, cooperate, cooperate. You're going to
see something very similar coming out of the tit for tat agent.
- I see, I see, I see.
- In fact, we can represent the strategy as
a little finite state machine, like this.
- Yeah, I like that, okay.
- And you, so you can see exactly how it kind of proceeds. It starts
off cooperating. And then in each, each round it waits to see what the opponent
does. That's the green letters here. And
then it follows the corresponding arrow, to
determine whether it is going to cooperate
or defect in the, in the current round.
- Sure, that makes sense.
- So in this picture, the black letters here
represent my move. And the green letters represent my observation
of the opponent's move. Or at least if I'm being tit for tat.
Tit for Tat Quiz
- What happens if we follow Tit for Tat?
- That's a great question. So, let's make it a
little more concrete. So, here's a set of strategies that an
opponent might adopt. And now the question is, what does it
look like Tit for Tat is doing in response to each
of these? And I was hoping that you'd, you know, check
the corresponding boxes. So basically, for each row, say okay, if
you're playing against, if Tit for Tat is playing against always
cooperate, what does Tit for Tat do? Does it always defect,
does it always cooperate, does it cooperate and then
defect, defect, defect, defect, does it alternate between cooperate and
defect? And this should, just to give you some
practice in, in interpreting the behavior of Tit for Tat.
- Okay, that works for me. I think I could do this.
Tit for Tat Quiz
- Alright, so let's start off, maybe you can tell
me what happens when Tit for Tat plays against always cooperate.
- So what happens if you always
cooperate? Well, let's see, I start out cooperating.
- And since the other person is cooperating I will
continue to cooperate because that's what they did the last time.
- So, I will always cooperate.
- That's right. Good. Alright, so what
about if we play against always defect.
- Well if we always defect, the first
time I'm going to cooperate because that's what you said Tit for Tat is.
- But from that point on, I will do what my opponent does, which
is defect. So I will cooperate and
then defect, defect, defect, defect, defect, defect, ellipses.
- Yeah, so I put this always defect in there, but actually it
can't ever be the right answer right [LAUGH] because Tit for Tat always starts
off cooperating. So the, the other three seem like they might be possible,
always defect is not possible. Good. Alright. So, what if Tit for Tat plays
against another Tit for Tat?
- Well, I started out cooperating my opponent cooperated. And since
I'm going to do what that person does I will cooperate, but
since that person's doing what I did, they will also cooperate.
And so we will both cooperate forever, so we will always cooperate.
- Nice. So isn't that kind of interesting? So Tit for Tat
even though, if, when it's playing against itself, is a very cooperative fellow.
- Hm, I
- But if Tit for Tat is playing against
something that defects, it becomes a little bit more vengeful.
- Alright, so what if it plays against something that is
a little unsure of itself? So it starts off defecting, then
cooperating, then defecting, it's sort of almost an anti kind of thing,
right? So it's, this is one that starts off with a defect.
- Right, so it always, so I, first thing I do is
cooperate. And then after that, effectively I do what the opponent does one
step before. So I basically take what you, I'm pointing to the
screen, you can't see me. I take the the D-C-D-C-D, and I
just put a C in front of it, because that's what I'm
going to do. So, I will do,
C-D-C-D-C-D-C-D, ellipsis. So that's your last choice.
- Isn't the C-D-C-D-C-D, ellipsis in Atlanta?
- It is, actually. It is the home of
all such analysis of diseases in the United States.
- The Center
for Disease Control?
- Is that what it's called?
- Yes, that's exactly what it is. Come to Atlanta and work for us.
- [LAUGH] Alright, so good. So, that, that's the pattern of,
of responses that Tit for Tat makes against this set of strategies.
- Oh, so we answered my question.
Alright so now that we have a sense of what tit for
tat does against various strategies, lets try to think about what we
should do against tit for tat, so what do we do if
we're facing tit for tat. So I'm just going to break it down
to two possibilities it turns out there's actually more, but these these
two are pretty instructive. So lets pretend that we have to choose
between always defect as a way of playing against tit for tat,
or we have to be always cooperate playing against tit for tat.
So what I've written down here is what the total discounted reward is going to
be or the total reward in this case. As a function of gamma. So, lets
start with what happens if you play
always cooperating against tit for tat. Well, you
already told me that it. Such a thing
will result in tit for tat always cooperating.
- Mm Hm.
- And that means we're going to play in
this box. The cooperate-cooperate box. And that means on
every single round, we're going to to get
a minus one. Which means over an infinite
run we're going to get an average of one, sorry minus one over one minus Gamma.
- Mm Hm. That makes sense.
- Okay? You agree with that?
- I do. Just minus one repeated over and over again.
Now always defect as you recall you told me that, that will
result, well for the always defect agent against tit for tat, the
first thing that is going to happen is it's going to defect
while the tit for tat cooperates right? So we're
going to get zero for playing that strategy on the
first round. Zero doesn't sound very good but look
at the alternatives. They're all negative so zero's pretty good.
- So it does this sort of you know good thing in the
first step then after that tit for tat responds by always defecting in
response. Right? And that means we're going to be stuck in this box
the defect defect box where you get minus sixes for the rest of ever.
- So that means minus six over one minus gamma. But that
starts one step from now so we multiply it by another gamma.
- Alright. So these are two different
expression that represents what our pay off
would be for adopting two different strategies.
And in fact, if gamma is really high,
very close to 1, then this is a really good answer, right? Because it sort of
grows, it's like, minus over one minus gamma. So, for high gamma, we're talking
about something that's minus one times a really big number.
- Whereas this first one is not so good
for high gamma because what's going to happen is it's
going to get, end up getting minus six on every
step. So it's going to do worse overall. But if, if
we're talking about the low gamma. Then, let's say,
you know, zero for example. A gamma of zero will,
will always defect. We'll get zero plus zero. But
always cooperate, we'll get negative one over one. And zero
is better than minus one. So for really small
gamma, like if the games unlikely to last many
rounds, you should defect. But is the game is
going to last a long time, then you should cooperate.
- I believe that.
- Cool! Alright, so then my question to you is: What's the value of gamma
for which these two different strategies to
play against tit for tat are equally good.
- I think I know the answer.
- Woah! That was fast. Let's give everybody else
a chance to think about it.
- Alright Charles, what, you said you had
the answer how do we figure it out?
- It's 1 6th.
- I guess that's what, which, we don't, I don't know if it will accept that,
whatever 1 6th is expressed as a decimal, but yes 1 6th. How did you get that?
- I saw the number six and figured it had to
be 1 6th because you said it was low, but here is
what actually I did, well, I was thinking about it is, you
said, well when are they equally good? Well, if you always defect,
you get minus 6 gamma over 1 minus
gamma. And if you always cooperate, you get minus
equally good when those two values are the same.
- Good, alright, and the denominators are the
same, as long as gamma's not one, that's fine.
- If we divide by the negative 6, we get gamma equals 1 6th.
- Excellent. So, so that's interesting, right? So,
that's, it's saying that for gamma values that are
less than 1 6th, we should be doi, we should just defect because there's
no. Well the games not going to last long enough for us to form any
kind of coalition. But for things higher than 1 6th. A half. 3 quarters.
- Or 6 plus epsilon.
- Yeah. I like that. That's actually very cool.
Finite State Strategy
Now, we kind of cheated here. Because I told you there's just
these, those two strategies. But there's actually a bunch of other strategies
you can play against tit for tat. And it's worth thinking through,
how do you compute a best response to some finite-state strategy? So
tit for tat is a finite-state strategy in that it has these
two [LAUGH], these two states. And the strategies expressed in terms of
transitions between those two states. But in general, if we have some
kind of finite-state strategy, like tit for tat, how do we figure
out how to maximize our own reward in the
face of playing against that strategy? So in this
picture here that I drew, the states are labeled
with the opponent's choice, the finite state strategies choice, okay?
- The edges, that's in black, the edges and
labeled in green here, are labeled with our choice.
So, for example, if we're in this state of
the, sorry, if our opponent is in this state.
- We have a choice. We can either
cooperate or defect. On this round.
- So, the green arrows tell us how that will impact the state of the
opponent. And then these red numbers, I
just added the information about well I know
that if the opponant is about to cooperate and I choose to cooperate. I can
just look up in the pay off matrix that that's a -1 for me. Right? Agreed?
- So I just annotated all these edges, all these choices with these
extra numbers. So, one of the things that's cool about this is unlike
just the payoff matrix representation that we had before, our
choice, it impacts the payoff, which is the same as
that, but it also impacts the future decisions of the
opponent. And that gives us this structure here and also
says that maybe this is a slightly harder thing to
figure out because of the fact that we can't just
maximize our, the number. We actually have to think about
where that's going to lead us in the future as well.
- So two things then. One,
I was always fond of saying that the matrix was all that you
needed. But that really only made sense when you were just playing once.
- Yes. That's right.
- Right? And, two, I look at this and it's a finite state machine
but you know what else it looks like to me? It looks like an MDP.
- Excellent. It is indeed an MDP. Now,
it's a, in this case, my opponent's finite
state strategy is deterministic, so it's a deterministic
MDP, but it is. It's a discounted MDP.
Gamma's playing the role of the discount factor.
The entries from the payoff matrix are playing
the roles of rewards Our action is playing
the choice of our action, and the opponent's
internal state structure is playing the role our
states. So it is, it's an MDP, and
so how do we figure out what an
optimal strategy is against a finite state strategy?
- We solve the MDP.
- Yeah, exactly. So any, any method for solving an MDP can then be used
to actually compute the strategy, so what is the strategy going to look like? It
is going to be a mapping from states of the opponent to action choices for us.
- Right, but that's fine because a state
does not have to be your state, it's
just what matters. What matters in this case
is what they opponent is going to do.
- Right. So now what are the strategies
that can be meaningful against tit for tat?
if we cooperate then we're going to stay in
this state and it's always going to be the
right thing to do to cooperate. So always
cooperate is one. If we always, if we defect,
now we have a choice again so we
could defect from this state which would cause us
to defect forever, so always defect is another
one. But what's the other thing that could happen?
- Well, we could tit for tat ourselves.
- Well, sort of. I mean, so, we could defect, so we could
defect against cooperate, but cooperate against defect. Which
would actually cause us to do D-C, D-C,
D-C. So those are the only, oh I see. No, you're right. I'm not sure how to
say it. But the policy is, defect when
you're in this state, and cooperate when you're in
this state. But the effect of that is to go back and forth against tit for tat.
- Basically, take this loop here.
And those are the only policies that matter. And in this case, we
worked out that. Always cooperate is good against tit for tat if it has
a high discount factor and always defect is better if you have a
low discount factor. But, we can get that for real by solving the MDP.
- Right and that makes sense and the reason that those are only
only 3 that makes sense because if you think of this as an MDP
then it has no history, so when you are in C there are only
really only two choices so if you look at the way
you've drawn it. You either stay were you are in
c or d, or you take the loop. And those are
really the only three options because the rest of them
would require you remember what you did, you know, a time
step or two ago, and there's no way to do
that in an MDP, at least not as you've written it.
- Well, there's a way to do it, it just would never be better
than doing it this way. So an MDP always has a mark off deterministic
- So we only need to consider those.
- I think that was the same thing that
I was trying to say, but with different words.
- [LAUGH] Ok.
Best Responses in IPD
Alright so, now that we have a handle on what it means
to compute a best response against some finite state strategy. Let's actually
take a look at these. So, so this is a quiz. Imagine
that we have a gamma that is large, so greater than a six.
- What is the best response to each of these
strategies? So this will be a quiz, but let me just
make it clear what the, the goal of the quiz
is. So, if you're, if you are playing against, always cooperate.
Which of these is the best response to it, which of these is
going to actually have maximum reward?
If we're playing against always defect which
is going to have maximum reward? And if we're playing against tit for tat
which are going to have maximum reward? Any
questions about that, does that make sense?
- I'm just making certain I have the rows and
the columns right. So my opponent is playing on the rows.
- And so I'm choosing among the columns for each one of those rows. Yes?
- Yeah. So I labelled it best
possible response for the columns. And these
are the opponent's strategies on the rows.
- Okay, go.
Best Responses in IPD
- Alright. So is this, is it clear what these answers are?
- Let's find out. So, let's see. we, we already worked out
the math on this. So we know that, for gamma, greater than
going against someone who's always going to
cooperate, then I should always cooperate.
- Yes, so, so, we didn't actually work that out. What we worked
out was what to do if you were playing against tit for tat.
If you were playing against someone who was
always cooperating, and is completely oblivious to us.
- No, no, no, no, no, you're right. You should
always defect because you're always going to win. Yes, yes, yes.
- You're always going to win. You're going to get zero on every time step.
- Right, right.
- No, that makes sense. That's beautiful.
- Alright, what about always defect?
- Well, if you're always going to defect, you might as well defect.
- because we're just in the regular old prisoner's dilemma world. [LAUGH]
alright. So now we, now we have this, this other strange beast here.
So our opponent is playing tit for tat. So we could always defect.
- But we would do, we'd get a higher score
if we can convince tit for tat to cooperate with us.
- For a gamma greater than a 6th.
- That's right.
- So that you should always cooperate.
- That is true, however.
- What if we played tit for tat against tit for tat.
- You'll end up in the same place.
so that's just as good.
- And that's, that's kind of interesting.
If you think about mutual best responses.
- So that's a strategy that, a pair of strategies where each is a best
response to the other, there's a, we
have another name for that. Do you remember?
- [LAUGH] You taught, you told us what it was.
- Yeah, but I probably used different words.
- It's, it's a Nash equilibrium.
- This, that's what a Nash equilibrium is. A pair
of strategies where each is a best response to the other. Each, ea there's
no way that either would prefer to
switch to something else to get higher reward.
- And that makes perfect sense. You're in
an equilibrium. And that is a Nash equilibrium. Okay.
- So we can use this little table here
to actually identify Nash equilibrium. So, what would a
strategy be? So, so if one player plays always
cooperate, then the best response to that is always defect.
- But the best response to always defect, is always
defect. So always cooperate is not part of a Nash equilibrium.
- But what about always defect versus always defect?
- No, it is.
- So that's a Nash. Right?
- Because they're both doing the thing
that is the best response to the other.
- Alright, this box here, always cooperate
against tit for tat. That's not okay,
because if a player does always cooperate,
it's always better to switch to always defect.
check this out. If you are playing tit for tat,
and the other player's playing tit for tat, there's no reason
to switch because it's actually a best response, it's the
optimal thing to do. And that works from both players' perspectives.
- And that makes sense. So like you said check this
out and you've been using check marks that's very good. [LAUGH].
- So we're in this situation where we have two Nash equilibria.
- Indeed, and one of these Nash equilibria, [NOISE]
is cooperative. Which is the thing that we were
sad about, or at least that I was feeling really sad
about, in the, in the last lesson. The idea that, man,
there's just, it's clear that they should just try to get
along. You explained that you can modify the reward structure, and
then they would get along better. But here it turns out,
well, no, another thing you can do is just open it
up to the possibility of playing multiple rounds, as long as
you don't know how many rounds, it becomes possible to to
have a strategy that is best off cooperating, and is in fact a Nash Equilibrium.
- Isn't that equivalent to changing the reward structure?
- It's definitely related to changing the reward
structure. It's a particular way of changing it.
- A very particular way.
- But it's. Yeah, because it's not true any more that we can do this
in the one shot case. You have to be in the, in the repeated game, setting.
- Right, so you change your rewards structure to be a sum of
rewards. But it's actually an expected sum of rewards, and you don't know
where it is you're going to stop. So I guess you're changed, you're changed
the game, you changed the, the rewards. But in a, sort of very subtle way.
- Yeah, the whole game is different, really.
- Yeah man, you changed the game. [LAUGH].
- Sometimes you gotta change the game. Don't hate the game.
- No, you are supposed to hate the
game. Don't hate the player, hate the game.
I'm really proud of us for figuring out a way to make the prisoner's
dilemma a little more friendly. It turns out, though, that this, this idea, this
sort of core idea, is very general and kind of cool. So, it leads
us to a topic that we could call, repeated games, and the folk theorem.
- So the general idea here is that when you're in the repeated game setting,
this possibility of retaliation, the possible of, you
know, not being cooperative actually opens the door
for cooperation. Because now it becomes, it can become better
to just get along and cooperate than to get retaliated against.
- Right. And that makes sense as long as the retaliation is plausible.
- Well, we're not talking about
plausibility of the retaliation. We're just talking
about, right, because the tit for tat, it's, it's not analyzed in that way.
It, it doesn't say whether or not it's actually plausible. It just says, well,
if you have this strategy and you
play against this strategy, there's no incentive
- But we'll, yeah, we'll get into this plausibility thing in a little bit.
- Oh, I, I stumbled across a word. Okay, go.
- [LAUGH] But I want to point out one thing first, which
actually kind of irritates me. Kind of along the same lines as
like, regression is the wrong word and reinforcement is the wrong word.
Folk theorem is the wrong word. So, what is a folk theorem.
- It's two words.
- In general mathematics, sorry say again?
- A folk theorem is two words.
- That's fair, yes. So, I think it's actually
two different terms of art. So in mathematics folk theorems
are results that are known, at least to experts in
the field, and they're considered to have established status but
they're not really published in their complete form. There
isn't some kind of original publication saying, oh look, I,
I found this thing. It's more, something that like, kind
of everybody knows, and so you don't really give anybody
credit for it. So it's like a theorem that is in the general
mm, cloud of understanding. It's sort of among the population, among the group.
- Wait! Does, does anybody every prove these folk theorems?
- You can, yeah! All folk theorems are provable. But it's
not like you say you know this is Charles Isabel's theorem.
- It's like. No, it's just a folk theorem. It's
like. Charles, Charles can prove it. We can all prove it.
But we, we're not really sure whoever proved it first.
It was just one of those things that everybody knows.
- Oh, so it's like an oral tradition.
- Yeah, yeah. I think that's a good way to think about it.
- Okay. I'm just imagining mathematicians sitting around
a fire in the winter, cuddled up against
one another sharing theorems that have been proved
since the beginning of time to one another.
- Exactly. Right that's the image that I have as well. This, just it's
folk theorems. It's this sort of, you know, I learned it from my grandmother and
now I'm telling you.
- Hm. I like that. I like to think of mathematicians that way.
- [LAUGH] As grandmothers.
- So so that's what a folk theorem is in mathematics. However, in game
theory it means something different. It is
referring to a folk theorem, but it's also
referring to a particular folk theorem. So
the folk theorem in game theory refers
to a particular result that describes the
set of payoffs that can result from Nash
strategies in repeated games. So it's a funny thing, it's a funny way that they
use the word. It's, it's a, it's a folk theorem but it's also the folk theorem.
- Hm, well, well, maybe it was actually,
proven the first time by some guy named Folk.
- That's a good idea, we should probably, we should push
that forward like call him Folke or something like that. Folke's theorem
- But but no. [LAUGH]
- Oh well, I tried.
- Yeah that's that's really great idea
that we're going to just move on from.
- But what I, what I'd like to do next is kind of build up to this
notion of the Folk Theorem and it's going to
require a couple basic concepts that are not too
different from stuff we've already talked about, but
we're going to kind of make them concrete and then we're
going to show how they all come together to
provide us with this idea of a Folk Theorem.
Repeated Games and the Folk Theorem
- The thing that I find most useful in
trying to understand the folk theorem and, and what
it says and how it works is a thing that I call a two-player plot. I don't know
if other people have other names for it, but
this, this concept is out and, and often discussed.
I just don't know what it's called [LAUGH]. But
here's the idea of it, it's a really simple idea.
- It's like a folk plot.
- It is a, it is a folk, a folk plot, right,
I don't know who invented this plot. So here's what we're going to do,
remember this prisoner's dilemma game. We've
got two players, there's Smoove and Curly.
- And, what we're going to do, is
we're, there's a bunch of joint, actions that
they can take. So Smoove cooperates and
Curly defects, or they both defect, or they
both cooperate, or one defects defect cooperate
in the other direction. So what we're going to
do is for each of those joint outcomes,
each of those joint action choices. I'm going to
plot a dot, put a dot on a two-dimensional plot, where
this is the Smoove axis, and this is the Curly axis, okay.
- So cooperate-cooperate, remember from the
prisoner, prisoner dilemma payoffs, is minus
one, minus one. So I put a dot at minus one, minus one.
Defect, defect. Is it minus six, minus six? Cooperate defect is it minus
nine zero? And defect cooperate, is it zero minus nine? . So do
those four points make sense to you? Do you understand like the idea?
- They do.
- It may not be so obvious why we do, do it this
way, because as you have told us, the matrix is all you need.
- But this is really just representing the matrix in another form.
But it actually is sort of losing some information. Because these dots
don't tell us what the relationship is in terms of if, if
one player keeps the same action and the other player changes to
a different action. The matrix captures that but
this plot doesn't anymore, it's kind of washed out.
- I see.
Repeated Games Quiz
All right. Now so just to kind of reinforce this
idea and also to let us think about it in
a slightly different direction, I want you to consider solving
the following quiz. Here's four payoffs that I've put little
boxes around. This one at minus 1, minus 1, this
one at minus 3, 0, and this one at minus
payoffs can be the average payoff for some joint strategy?
And this is in the repeated game setting, [CROSSTALK]
okay. So what we're imagining is these two players can
coordinate in any way they want. We're not talking
about trying to maximize reward or, or, being Nash or
anything like that. They're just going to execute some,
some pattern of, of strategies over infinite run. We're going to
average the payoffs that the two players get and we'll
say, you know is it possible for them to adopt
a strategy so that the average per time step gets minus 3 for Smooth
and 0 for Curly? Yes or no? So check that box if that's possible. Can it be that
they both get minus one on average? If so, check this box. And can it be
that they both get minus 4 on average?
Check this box. Does that, does that make sense?
- It does make sense.
- All right, so I'll give you a chance to think it through and answer it.
Repeated Games Quiz
- Alright, so are any of these easy to answer right off the bat?
- Well, one of them is really easy to answer
right off the bat, and that's minus 1 minus 1.
- Because that's one of the points, so. [LAUGH]
- Yeah. So, the joint strategy to get the minus
one minus one would be for both players to cooperate, right?
Again, they're just, you know, they're just willing to do that
just for the sake of showing they can make that value.
- Right, and I think I know the answers for the other ones as well
- Alright, hit me.
- So here's how i'm going to, here's how I'm going to, going to talk
you through my reasoning so that if any point I'm wrong, you can
gently steer me away from embarrassing myself. Basically I was looking at these
points and I thought hm, they form a kind of a convex hull.
- And I thought well, surely that's just an accident of the numbers
and then I thought, oh wait, of course we're talking about averages here.
So that means that all sort of possibilities have to be inside the convex hull
of the outer points. So if I drew a line between those four points,
I would end up with all possible, achievable averages.
- What, achievable averages, how? How would
you achieve things inside the convex hull?
- I would appropriately average them. So in
particular, the first thing I'd notice that minus three,
zero, is outside that.
- So, it can't be, it can't be something you can achieve.
- Right, there's no way to make this minus
three, zero, certainly not by choosing any of these
points, but also no combi, no convex combination, no
probabilistic combination of them, is going to do that either.
- So, this one is just right out.
- Right, so that leaves minus four minus four, and it
seems pretty. That's inside the convex hull, so there is some
combination of them that would work.
- That's right. Any, any sense of what it would be?
- 2 3rds, I think 2 3rds D,D 1 3rd
C,C. And I get that by noticing that D,D is minus
six, minus six, and C,C is minus one minus one, and
four minus four is two thirds of the way between there.
- Boom. Cool. Alright, so you're good with that?
- I'm good with that if you're good with that. It's your quiz.
- Yeah. I'm excited.
- Oh good.
- So those, it's this, this, the minus one
minus one and the minus four minus four are,
as as you pointed out there is a more
general result here. Having to do with the convex hull.
- Alright, so through the magic of computer graphics, I have a
slightly better depiction of this particular
region now. As you pointed out this,
this convex hull of the points is really important and what it
represents is the, we can call it the feasible region, these are average
payoffs of some joint strategy they, they may have to
collude to do this. And they may not be particularly happy
to do that. [LAUGH]. But the fact of the matter is,
they can achieve, by working together, payoffs anywhere inside this region.
- Huh, so that's what my student Liam
always meant when he talked about feasible regions.
- Maybe so, I mean it could be that he meant any number of
other things like places he's willing to live when he goes to get a job.
- No, it was all game
theory stuff, I just never knew what he was
talking about, but now I do Michael, thanks to you.
- Sure, hey I'm, I'm happy to help. So
this is a really useful kind of geometric way
of picturing something that would otherwise be a little
bit harder to see, I think in the matrix form.
- All right, the next concept that we're going to
need to understand the folk theorem is the notion
of a minmax profile. So a minmax profile is
going to be a pair of payoffs, one for each
player. And the value for player represents the payoffs
that that player can achieve by defending itself from
a malicious adversary. So what do you suppose a
malicious adversary would mean in a game theory context?
- Someone who's desperately trying to hurt you.
- And what does hurt mean?
you the lowest score.
- Yeah, and what does that remind you of from your lesson?
- My grad students.
- You think they're malicious?
- It would explain a few things.
- Yeah, I don't think they're malicious.
- They're sweet.
Yeah, I know a lot of them and they're, they're wonderful people.
- Well, what it reminds me of is, they
are wonderful people. It reminds me of zero-sum games.
- Exactly. So you can imagine thinking about the game
that we're playing, now, no longer as being I get my payoff and
you get yours, but I get my payoff and you get the negative of
my payoff. So you don't, you don't really care about yourself anymore. All you
care about is hurting me. And that's, that's the idea of a malicious adversary.
- I have some ex-girlfriends like that.
- I'm so. Oh. [LAUGH].
It is. People do get into this mode sometimes. And that,
that's actually going to be important in understanding the folk there.
- So what
I'd like to do is figure out what the min-max profile is for this game. So this
is a very famous Game theory, game example.
Sometimes goes by the name battle of the sexes.
- That what the b and the s stand for?
- Sometimes it stands for Bach and Stravinsky.
- Those are like composers, I think.
- You mean like the Backstreet Boys and Sting.
- Ahh. Alright, that works for me. So,
so, let me explain this story. It turns out that
Smooth and Curly actually got away, they didn't make any kind
of deal, they actually just figured out a way of
escaping from the jail. So they're, they're back out on the
streets again, and they decided that they'd like to celebrate
their freedom by going out to see a concert. And they
both decided that in advance, but what they didn't know was
which concerts were available. Once they escaped out into the world,
they couldn't communicate with each other, they
discovered that there's in fact two concerts
in the city that night. The Backstreet Boys are playing, and Sting is playing.
- Now as it turns out, each of them is now going
to have to choose, whether to go to the concert with Backstreet
Boys or Sting, and they're choosing independently. Now, if they end up
going to different concert's they're both going to be unhappy and get zero's.
- I see.
- If they end up at the same concert, then they're going to be happier,
but in fact, as it turns out, Smooth really likes
the Backstreet Boys and would prefer that they both end up
at the Backstreet Boy concert. But Curly really likes Sting
and would prefer that they end up at the Sting Concert.
- That's not realistic.
- Which part?
- The fact that I prefer the Backstreet Boys to Sting.
- What, what do you mean you? I mean this is Smooth. He's a criminal.
- Mm, that's a fine point that you
make there. There is no connection between these
characters and ourselves.
- Real life characters, living, dead
or fictional, or mathematical, or instructional.
- If so, otherwise purely coincidental.
- Yeah. I could switch these around if you'd prefer.
- No, no. I'll go with your fantasy.
- [LAUGH] All right. I, yeah, I think that payoff matrix may look
something like we both have twos in, in the s, the same place.
- But anyway, but let's say for the purposes of this example,
there's a little bit of a disagreement. Okay, so now what we
need to figure out is what the minmax profile is for this game.
- Alright, so that's going to be a pair of numbers.
- One number corresponds to the payoff for Curly and one number
corresponds to the payoff for Smooth. And it should be, the payoff
for Curly should be the payoff that Curly can guarantee himself even
if Smooth is trying to get him to have a low score.
- And vice versa. Smooth's score is going to be the score that it
can guarantee, he can guarantee himself even
if Curly is trying to minimize Smooth's score.
- All right.
- So let's do this as a quiz. So,I want you
to find the min-max profile for this game, this Bach, Stravinsky
game or Backstreet, Sting game, and put the, the number for
Curly in the first box and Smooth in the second box.
Alright, what'd you got?
- I'm going to say it's one, one.
- Alright, how would you figure that out?
- Well first, I would ask you if that's correct. [LAUGH]
- [LAUGH] We'll see when you try to.
- Figure it out.
- Okay, so, the idea here is I'm trying to
figure out what Curly would get, if Smoove was out to
get, make certain that Curly got the worst possible value.
So, Curly gets to ch Curly gets to choose among rows.
And Smoove gets to choose among columns. So, If
I chose the B column, then Curly could get
a one. You're given choice Curly would go for
the first row instead of the second row, because
Curly would get a one. If I took the
S column, as Smoove, Smoove took the S column,
then Curly would choose the second row, and would
get two. And one is lower than two. So,
Smoove would choose the first column and Curly would have to get
the first row and would end up with a one. That make sense?
- Yes, but here's what you haven't figured out.
- What happens if Smoove, no because, because
Smoove says, I'm always going to choose the Backstreet Boys.
- Then you're right, Curly should also choose that and at least coordinate.
- But what if Smoove is random say
half and half random between the two options.
- And I
don't know what's going to happen before then?
- And you don't know what's going to happen, sorry?
- I know that, I know it's half Curly knows it's half
and half but doesn't know which one he's choosing any given time?
- I see. Then, I would end up with, one half and one.
- Then for choosing B.
- Expect the score would be a half, and for choosing
S, they'd expect the wor, expect the score would be one.
- Oh, I see. And, and then Curly would choose S and
get the one.
- And that's still consistent with, with that?
- Can we work out what the what the worst case is?
- What do you mean?
- Well I feel like, we should solve it
like a [INAUDIBLE] game, like we did in your lesson.
- Say that Smoove is actually choosing a
pro, probability either X or 1 minus X.
- So I thought you were asking for like
a pure decision, I wasn't even thinking about mixed decisions.
yes, it could be a malicious and randomized adversary.
- [SOUND] So we are.
- [LAUGH] You said yuck [LAUGH].
- We, we are really talking about my ex-girlfriends.
Okay so you just do the math. Do the math.
- Alright, so if we, if Smoove chooses Backstreet Boys
with probability X and Sting with probability of one minus X.
- Then, Curly for choosing Backstreet Boys will get X on average and
for choosing Sting, we'll get two times one minus six on average. And the
useful point is going to be to discover when these are equal to each other.
- So in fact, Smoove, by being malicious and
stochastic, can actually force things down to 2 3rds.
- Hm. And things being symmetric as they are, Curly can do the same.
- So, basically, Curly can behave in such a way, that even against a
malicious adversary, it could, he could guarantee himself to a score of 2 3rds.
- Yeah, a malicious possibly mixed adversary.
- That's right.
- Okay. But one, one would be right if we
were sticking with pure strategies, but why would we do that?
- That's right, but for the purpose, if that's right, and you can do a
version of the folk theorem. In fact, there's
lots of different flavors of the folk theorem.
The one that we're going to focus on
is going to allow for these mixed strategies.
- But. In fact in general you could say you know,
no I kind of like the mixed strategies, let's just stick with that.
- No, I like the mixed strategies too, I just wasn't thinking about them.
- So, I want to point out that in fact the solution that you gave, I
think that actually does correspond to what is
usually called Minmax, which is the pure strategy.
- So the, [LAUGH] the minmax is, is in fact one, one.
The other concept is really important too though. And I think
it's sometimes called the security level profile. So instead of the min
max level profile the security level profile. And that allows for
the possibility of, of mixed strategies. So that gets you down to
the 2 3rds, 2 3rds. I think you know, it turns
out that there's folk theorems that can be defined with with either
of those concepts. I prefer this one. But I do like this
name better [LAUGH]. So I, I apologize if that made things confusing.
- I'm not
confused now. I think in the end Michael,
the important thing is we were both right.
- Well, the example that I'm going to next, these two concepts line
up. So let's let's do that, and then we don't have to care. [LAUGH]
- I, I'm all for that. Let's do that.
Security Level Profile
So, here we are, back at the
prisoner's dilemma, again. You may recall this picture.
- Let's add to this, the minmax, or security level
profile. So, for prisoner's dilemma, what is the, the minmax?
- Isn't it d comma d?
- It is indeed. D comma d. Right so this
is value that you can guarantee yourself against a malicious adversary.
Malicious adversary is just going to defect on you, and the
best thing you can do in that case is defect yourself.
- Agreed, agreed?
- Alright, so now let's take a look at the intersection of the following two regions.
There's this nice yellow region that we've already got, and
then we've got a new region that's defined by this
minmax point. This minmax profile. So the region that is
above and to the right of this, of this minmax point.
- So, the, that's the region. This, this region, alright we
already said that this yellow region is called the feasible region.
- Mm-hm, or orange or whatever color it is.
I'm thinking we can call this other region
the acceptable region. And it, and, the, what I
mean by that is if you think about
it, payoffs in this region are, smooth, getting more
than what smooth can guarantee itself in an
adversarial situation. And Curly getting more than Curly could
guarantee himself in an adversarial situation. So, these are
all, like, you know, better than it could be.
- So, why not call it the preferable region?
preferable regions, preferable to not being in this region.
- The intersection of these two
Is the feasible preferable acceptable reason?
- [LAUGH] Exactly. It's kind of, you
know, special, from the perspective that it
is both feasible and preferable. And now we are ready to state the Folk Theorem.
So here's the Folk Theorem. Any feasible payoff
profile that strictly dominates the minmax or security
level profile can be realized as a Nash
equilibrium payoff profile, with a sufficiently large discount factor.
- Prove it.
- What we're going to do is we're going to construct a
way of behaving where both players are going to play
so that they achieve this feasible profile. And the reason
to make that a Nash equilibrium, what we need to do
is make it so that it's a best response. And the
way that we are going to make it a best response is
we're going to say do what you're told. Follow, follow your instructions
to achieve that feasible payoff. And if you don't, then the
other player is going to attack you, is going to adopt
a strategy that forces you down to your minmax or security
level, and that's your threat. So the best response to that
threat is to just go along and, and and do what
you're told to achieve that feasible payoff. The only
way that that's going to be stable though is if the
thing that you're asked to do, the feasible payoff, is
better than the minmax, right? Because that has to be
a threat. You can't threaten somebody and say, you know,
do this or I'm going to give you candy. It's gotta
be do this or I'm going to do, give you something
that's less pleasant than what I've asked you to do.
- Okay, that actually makes sense.
- Yeah, so this is, this is a really cool idea.
- I like it. Hey, could you try saying that again,
but with a Southern accent, just the Folk Theorem part?
- Any feasible payoff profile that
strictly dominates the minmax/security level profile can
be realized as a Nash equilibrium
payoff profile with sufficiently large discount factor.
- I like that because now it's a folksey theorem. [LAUGH]
So another way to think about the proof of the folk theorem, is you
could prove it with a little strategy that is referred to as grim trigger.
- Mm. I like that. I like the way it sounds.
- So here's, here's the basic structure of grim trigger. It says that
what we're going to do, is we're going to start off, taking some kind of action
or pattern of actions, it's a mutual benefit. And as long as it's
cooperation continues this mutual beneficial, behavior will
continue. But however, if you ever cross
the line, and fail to cooperate with me.
Then I will deal out vengeance against you forever.
- Hm, so once again, we're talking about my
ex girlfriends. I don't know why you're obsessed with this.
- [LAUGH] Maybe it's just, maybe it's just trying to help
you understand. So you know kind of what this situation is.
Anyways here's, here's what that looks like in the context of
prisoners dilemma. Alright so cooperation is
the, is the mutually beneficial action.
- And as long as you continue to
cooperate with me, that's this C arrow here, then
I will continue to cooperate with you, but if
you ever defect on me, I swear I will
spend the rest of my life making you pay. So no matter what you do at this
point, defect of cooperate, I will just continue to
defect on you. Pain will rain from the sky.
- Okay, well this makes sense. So, the idea here
is that if you know that I"m going to do this,
then hopefully it makes sense for us to continue to mutually benefit.
- Right. So the whole purpose of this is
to create, a, a Nash Equilibrium System kind of situation.
Right? Where if I'm playing this strategy. And you're playing
this strategy, then neither of us has any incentive to
cross the line, and so we're just going to continue
to cooperate. Crossing the line is going to decrease your
reward, so there's no benefit to doing it, so you
won't do it. So it, it's nice because it gets
us a Nash Equilibrium.
- But there's a problem with it.
- Of course.
- And you pointed it out before, so let's, let's dive in and
make sure that we understand it and see if we can fix it.
The problem is that in some sense, the threat
is implausible. And it, and it, in a very
kind of real sense. So what's happened is that if you do fake out on me, if you
do cross that line, the idea that I will
then spend the rest of my days punishing you,
forgoing reward myself, right? Not taking the best response
against you, seems kind of crazy. Do you agree?
again, my ex-girlfriend. Yeah, I totally get this.
- No, but no, I'm saying that nobody would do that.
- Right, so it, it would be like being in a elevator with a stick of
dynamite and seeing that someone has a hundred
dollars and saying, give me your hundred dollars
or I'll blow us both up. That's not
really a reasonable threat because the alternative to
me not giving you a hundred dollars is,
you die, which seems probably not worth it.
- That's right, so you could think about
the possibility of, okay, I'm going to not
give you the $100, you say that you're going to blow me up,
but you will hurt yourself more than you hurt me, so it won't
be a best response. Not blowing me up and just not getting the
$100 and leaving the elevator is better for you than blowing me up.
- So that is an implausible threat. So the
way we formalize this idea, in the game theoretic
context, is to say that we're interested. A plausible
threat corresponds to something that's called a subgame perfect equilibrium.
- So subgame perfect means that each player is always taking a best
response, independent of the history. All right,
so let's actually look at a concrete
example here, let's imagine playing Grim Trigger,
against Tit for Tat. So my first
question to you is, are these two strategies in nash equilibrio with each other.
- Yeah I guess so.
- And why is that?
- Because, the fact, if I'm playing Tit,
if one player is playing Tit for Tat
then the Grim Trigger thing doesn't matter anyway because both of you
are going to cooperate forever and it doesn't make any sense to deviate.
- Right, so any strategy that I could choose that's different than
Grim Trigger is going to on average do no better, possibly worse.
- So I might as well stick with Grim Trigger
and Tit for Tat has the same kind of feeling about
it, that, it's cooperating with Grim. And any, it can't
really do anything better so it might as well do that.
- So the next question to ask is.
Are these two strategies in a subgame perfect equilibrium with
each other? And the way that you actually test that, is
you say, well, they are not subgame perfect. If there's
some history of actions that we could feed to these machines,
so that, so that, you know, here's, here's what Grim
is doing. It's, it's some, some sequence of cooperates and defects,
and here's what Tit for Tat is doing. It's some sequence
of cooperates and defects. And once we've reached some particular point.
Is it the case that one or the other of these
machines is not taking a best response that it could actually change
it's behavior away from, what the machine says and do better then
what the machine says. If that is the case then it's not
subgame perfect but if it's the case of all histories. They're
always taking a best response. Then, it is subgame perfect. So, so
do you see a, a sequence of, of moves that these two
players can take where one or the other of them is not
going to be doing a best response?
- It's, can take, right? As opposed to, will take.
- I don't understand.
- Yeah, I'm not sure I do either. That's
why I asked the question. It's not a, you know
made up history, it's like an actual set of
moves that are consistent with Grim and Tit for Tat.
- No no, no no, so it is, it is not necessarily. So we know that
if we actually play these against each other
the only history that we're going to see, is.
- Cooperate forever.
- Right, Grim
is going to do cooperate cooperate cooperate cooperate,
Tit for Tat is going to do cooperate cooperate
cooperate cooperate. And so they are, and
everything's fine. The question is, can we actually
go in and alter, the history, so that one or the other in the machines
could take a better action than the one that the machine tells it to take.
- Yeah if Tit for Tat, ever does defect.
- Alright, so let's take a look at that. So, let's say,
on the first move Grim cooperates and Tit for Tat defects. Okay, so let's
say that, that's the moment in time. What will the machines do at this point?
- Well, at this point the and next time
step, Tit for Tat will cooperate and Grim will defect.
- Good and then thereafter.
- Grim will always defect.
- And then Tit for Tat will always defect.
- So the
pay of that Grim gets at this point is
going to be, well initially high but then very very low.
- On the other hand could Grim have
changed it's behavior to do better than this?
- Yeah. Just by doing just, by choosing to cooperate.
- By choosing to cooperate, so it sort of
ignore the fact that, that Tit for Tat did the
defect, and instead do a cooperate here, then Grim would
do better. So the idea is that Grim is making
a threat, but when it comes time to
actually follow through on that threat, it's actually doing
something that is worse for itself. Than what
it would do otherwise. Do, do you see that?
- I do.
- So is it subgame perfect? No. And the proof
of that is exactly, exactly what you said, Take, take
a look at this history. Here's a history where Grim
would not be willing to actually follow through on its threat.
- So it's an implausible threat, and that's
bad. So maybe we've now just undone all
the awesomeness that we had done.
- Well maybe. I mean the awesomeness was hey look we can actually
get machines that are in nash
equilibrium and they're cooperative in, in prisoner's
dilemma so they're actually kind of doing the right thing. And, turns out well
they are but they're, depending on this notion of implausible threats to do it.
- Mm, how should I feel about that?
- Well, let's see if we can fix it.
TfT vs TfT
All right. So let's make sure that we get this concept.
So let's evaluate, tit-for-tat versus tit-for-tat, spy versus spy, and ask
whether or not they are subgame perfect, or in a subgame
perfect equilibrium with each other. And your choices are yes and no.
- Mm hm.
- But if you say no, I'd like you to give me a sequence to
show that it is not subgame perfect. In other words, that if they were to take
this sequence of actions and this one was to take this sequence of actions,
it would leave the machines in a position where they would not be willing to
follow through on their threat. That it would be better for them to do something
else in the long run, assuming that
they're still playing against the other tit-for-tat machine.
- From that point on.
- Just imagine that someone could go in and change one thing.
Would you still want to follow tit-for-tat the next time step or not?
- Yeah, I don't know
if that has to be just one thing, but yeah. That's
right. Change, change the sequence leading up to this point and
then say, okay, do you still want to do what tit-for-tat is
telling you to do, or would you rather do something else?
- Right. Okay. I think I got it.
- Cool. All right. Go.
TfT vs TfT
Okay. What's your answer?
- My answer is no.
- No, I'm sorry, that's wrong. No, no, I think that, well, if
it's, if it's right, then we need to provide a sequence that proves it.
- So what are you thinking?
- Well, I was thinking actually something very similar to what we just saw.
- Where so tit for tat, what they're going to do is they're
going to do cooperate, cooperate, cooperate, cooperate,
right? That's what they normally want to do.
- So what would happen if at one
point, one of them defected?
- Okay, just for simplicity let's make it the very first point.
- So, tit for tat number one defects,
and tit for tat number two also defects.
- At the very first time? No, it
cooperates, because it's done at the same time.
- Well, I mean so we're, we can feed
it anything we want. So we could tell tit for
tat two to cooperate. So it's sort of like we've
taken over its brain for a brief amount of time.
- Right. So I'm not yet convinced it's going to matter
for this, but the thing is that from that point
on, tit for tat two is going to want to, for
the next step tit for tat two is going to want to defect.
- That's right.
- And, tit for tat one would want to cooperate.
- And that's sort of. Sucks.
- For tit-for-tat two, right?
- So, I don't know, so maybe we should try
to think through. What is the expected reward, for tit-for-tat
two, to actually do this defect at this time?
- Or wait, no. So, so, sorry to,
to stick with the tit-for-tat machine at this time.
- Well, what's going to happen at
that point is, it's going to keep alternating.
- That's exactly right. So it's going to get the, the rewards corresponding
to D versus C, C versus D, D versus C, C versus D.
- Over and over again.
- Yeah. So let's thi, let's think about it
in the average reward case. So, in the D versus
C, if it does D when the other machine does C, then it gets zero.
- If it does C when the other one does D it gets minus nine.
- And then this alternates. So if we look at the average award,
which is basically what you get when the discount factor's very, very high.
- It's scoring negative 4.5.
- Is there any way, that it could behave against tit for
tat, starting from this point that would do better than negative 4.5.
- Just go ahead and cooperate.
- Just cooperate forever?
- Well cooperate the next time and then keep doing tit for
tat from that point on. It'll work out to be cooperate forever.
- On average. That's right. What will get is a minus, minus
one. So not being tit for tat at that point but instead,
instead turning always to cooperate would actually get it better. So the
idea that is should defect at this point, is an implausible threat.
- So this is not
sub-game perfect. So yes, you nailed it.
- Does that make sense?
- It did make sense.
- Good, alright. So that leaves open the question of, is
there a way, to be sub game perfect in Prisoner's dilemma?
- Can I ask you a question?
- Before you answer that. So, I had sort
of convinced myself that it didn't matter whether tit for
tat number two started out with C or started out
with D. I'm trying to decide whether that's actually true.
- 'Kay, that's a
good question. So what will happen, at this from
this point on if we now continue. We, you
know, we took over the brains for tit for
tat, and we forced them to play defect against defect.
- And now we release that, and we let them do what, whatever it is
that they're going to do. And what is
th, what are they going to do? Is [CROSSTALK]
- They would defect forever [CROSSTALK]
- Defect forever.
- And is there anything that tit-for-tat two machine,
could do to get a better score than that?
- Yeah, so it could. Cooperating with tit for
tat will bring it back into mutual cooperation.
- It will actually get a better score.
- So, in one case, it would average to minus one and the other one, it
would, it would average to minus three and
minus one is better, so, you're right. Good point.
- So what matters is that we get we get them defecting.
- Right. Okay, so that makes sense. So, I, I was right that
it didn't matter, although you do get
slightly different answers, or slightly different averages.
- That's right.
But in both cases, there's a way of getting a higher average.
- Right. Okay, cool, that's what I thought. I thought it was
something like that. So now, let's go back to what you wanted
to do. So, are we going to be able to figure out how
to do Prisoner's dilemma in a way that is sub game perfect?
- Well, how about I propose a machine, and we'll see what it does?
So here's a machine that is sometimes referred to
as Pavlov named after the Russian psychologists who was
studying gastric juices and then figured out how animals
learn. So I don't know why it's called that
in this particular case but, here's what the machine
looks like. It says start off cooperating and as
long as the opponent keeps cooperating, then cooperate. So,
so far it sounds a lot like tit for tat.
- If you defect the then move to the defect state. So,
okay still looks like tit for tat. And
again, this defect state has two arrows coming
out of it. But their reversed from what they were in tit for tat. Here it
says, if you cooperate with me, I will
continue to defect [LAUGH] against you. But if
you defect against me, then I'm willing to
cooperate with you. So, that's a little strange, right?
- That is a little strange, yes.
- So, you know, the way to get me to stop defecting against
you is to defect against me. And then, I I become cooperative again.
- So, in other words, take advantage of you
until you sort of, pull a trigger on me.
- Seems like it. Yeah.
- It's a funny thing. So so again, so tit for tat is like repeat
what the opponent did. Pavlov is basically cooperate
as long as we're agreeing. If we both
defect, I'll cooperate. If we both cooperate, I'll
cooperate. But if we disagree, if I dis,
if I defect and you cooperate or you cooperate and I defect, then I will defect
against you on the next round. That sort of makes sense.
- Yeah, right? I mean, it says: if you're cooperating, I'm going to
cooperate. If, we're both going to start out cooperating. And then if you ever
defect on me, then you have basically attacked me, and so I'm going
to defect on you. Unless you start cooperating again, in which case I think
that you're, you're being reasonable, because of what I
did, and so now we're going to start cooperating again.
- Except that's tit for tat [LAUGH].
- No, you're right. You're right, I, I take
it back. It doesn't make any sense, too many.
- Yeah, it's weird. It's a little bit weird.
- too many V's. Too many V's.
- Too many V's?
- In Pavlov.
- Two V's. Yeah, but you can think of them being like arrowheads.
- Oh well then that makes much more sense.
- Yeah, exactly.
- Yeah, so it's this weird thing where I'm going
to continue to defect against you until you realize I'm hurting you, you punch
me once. [SOUND] And now okay, good, we're even again. We good? Yeah we're good.
- Oh this is like men on a basketball court.
- Yeah, sure but, now here's the question.
Is this Nash? So, we'll make that a quiz.
Pavlov v Pavlov Nash
So, here's a quiz. And this is, I chose it this way so that
I can maximize the number of v's in one sentence. Is Pavlov v. Pavlov, Nash?
- [LAUGH] I think I know the answer.
- Alright. Let's, let's give everybody else a chance to think about it.
- Okay. Go.
Pavlov v Pavlov Nash
- Alright, what'd you get?
- I say yes.
- And the answer, that comes from what?
- Well, we both start off and cooperate. And so, you are always going to
cooperate and it doesn't make any sense
for anyone to ever move away from cooperation.
- Hence, you are at equilibrium, in particular, you are at Nash equilibrium.
- That is correct.
Pavlov is Subgame Perfect
So, that's very good. So, you, were able to realize that Pavlov is
in Nash equilibrium. But we can go further than that. Or farther, than that.
- We can go further than that. And show that in fact Pavlov is sub-game
perfect. So, unlike tit for tat. It's
actually subgame perfect. So let's let's take a
look at how we could convince ourselves of that. So I think the important thing
to see is that by feeding these two
Pavlov machines different sequences we can get into
any of these four different combinations of states.
They're both in the cooperation state. They're both
in the. Both in the defect state, ones
cooperate, ones defect, ones defect and ones cooperate.
- so is it the case, that no matter which state
that those are in, that the average reward is going to be
mutual cooperation. So let's check, if they are both cooperating, and we
continue with these Pavlov machines then they will mutually cooperate, so yes.
Alright if they're in defect defect then what's going to happen?
- Well then they both agree, so then they cooperate.
- So they're going to move to cooperate
and then they'll stay there forever, so
then that'll be mutual cooperation. Awesome. What
if ones in cooperation and ones in defect?
- Then they disagree.
- Right. And they move to the other state.
- More specifically what?
- Oh, I don't know [LAUGH], I can't remember. I'm trying to keep
track of who, who's who. So if I cooperate. and you defect, then let's
see the guy who cooperates moves to defect. And the guy who defects
moves to defect. Because you, and now
you agree, and so you're going to cooperate.
- Boom. So, when we're in the
cooperate defect state, then on the first move.
Let's see, you just, yeah, the right-hand
Pavlov just defected, so that causes this transition.
And this guy just cooperated, which causes this
transition, so that we've gone to defect defect.
- Right. Which means that
we're going to average cooperation, because [CROSSTALK]
- That's where we're going to get stuck in the long run.
- And the same thing works through here. Boom.
- That's actually very cool, and kind of counter-intuitive.
- Yeah. And truly neat. So sort of no matter what
weird sequence we've been fed we manage to
resynchronize and then return to a mutually cooperative state.
- So I have a question for you.
- Go for it. So presumably this is really cool
like mathmatically because now we shold all do, we should
all be Pavlovians. Like we're all Kinseans. ANd then we
just kind of move forward from there. D people do this?
- I don't know the answer to that question.
- I mean other than men on a basketball course.
- Sure, you can always return to that. Though I'm not sure I'm aware of any
analyses of men on a basketball court and
whether or not You know, people have analyzed that.
- But how about this. If I find
out, I will post something on the instructor's comments.
- Okay, that sounds reasonable. So Pavlov
is subgame perfect. That's awesome. So remind
me again why I care that something is subgame perfect? Because it means that,
so let's say that you actually, so I'm being this left Pavlov
and you defect and me and you're like yeah I'm going to
defect on you because I just want to take advantage of you,
and you're, you're going to forgive me and I will have gotten this
extra bonus for that. And what it turns out is that No,
if we do Pavlov versus Pavlov, we're going to fall into mutual cooperation
no matter what. So, so, so this defection that I do, this,
this threat, this punishment that I deal out to you, you can earn
it back, and we can go back into a cooperative state.
- So, it's worth it to me to punish
you, because I know that it's not going to cost
me anything in the long run, and it stabilizes your
behavior in the short run. Sure. It makes perfect sense.
- So it becomes a plausible threat.
- I like it.
Computational Folk Theorem
So this Pavlov idea actually is more general than just the prisoner's
dilemma or iterated prisoner's dilemma. And in fact, led to a result that
I like to call the computational folk theorem. The idea of the
computational folk theorem says that you give me any two player, bimatrix game.
- What's a bimatrix game?
- Just that there's two players. [LAUGH]
- Okay. [LAUGH]
- So it seems kind of redundant, doesn't
- It does.
- What makes it bimatrix as opposed to
two player zero sum game which you can
write down with a single matrix, this is
like, each, each player has its own reward matrix.
- I see.
- But you're right, I should have, I could've just said bimatrix game and
left out the two player. And it's
an average reward repeated game. So we're going to
play. Rip, Round after round after round.
And we're going to look at the average reward.
Or, you can also think of it
as discounted with an extremely high discount factor.
you give me one of those games. And what I
can do is, I can build a Pavlov-like machine for, the,
for any of these games. And use that to construct a
subgame-perfect Nash equilibrium, for any of these games, in polynomial time.
- And, so the way that this works is if it is possible for us
to have some kind of mutually beneficial relationship,
then I can build a Pavlov-like machine. Quickly.
If not, the game is actually zero sum like, right? because
in a zero sum game we can't mutually benefit, so we cant
do anything like Pavlov we're just going to beat each other up. So
we can actually solve, linear program in polynomial time, and work out
what the strategies would be if we we're playing a zero sum
like game. And so either that works, and produces a Nash equilibrium,
and we can test that. Or it doesn't work, but at most
one player can improve its behavior. And by taking that best response
against what the other player does in a zero-sum like sense,
then that will be a Nash equilibrium. So there's three possible forms
of the Nash equilibrium. But, we can tick through these, figure out
which one is right and drive the actual strategies in polynomial time.
- Wow, that's pretty impressive, who came up with this idea?
- So this is a result due to Peter Stone and somebody, Oh yeah me.
- Oh, well that's very impressive,
so you managed to find a way to sneak in some of your own work into this class?
- Here, let's do some more of that.
- Okay, I'm a big fan. And I think that's
fair because I did that way back when on mimic.
- Mimic. So, yeah, so what the last topic that, this is,
that's all I really wanted to say about the Folk theorem and repeated
games. What I'd like to do now is move to stochastic games,
which is a generalization of repeating games. And, talk a little bit about
how this relates Back to things like queue learning and MDPs.
- Oh, okay. That sounds cool, almost sounds like you're wrapping up.
- It is, that. And that will be the end of, end of the new material.
- Wow. Well that means we're coming towards the end of the entire course.
- I know. We're going to all cry with, disappointment.
And I think. And, and I just say this,
you know, as a, as an idle suggestion, that
the students should demand that we teach more classes.
- I concur. So let's get there so that they can demand.
Stochastic Games and Multiagent RL
So what I would like to tell you about is a generalization of both MDPs and
repeated games, that is, that goes by the
name of Stochastic games, also sometimes Markov games.
- I like the name Marcov game better, but I used Stochastic game because
that's what people call it and sometimes
it's good to use words that other people
use. And what what Stochastic games give us is a formal model. For multiagent
reinforcement learning. In fact, I like to think of this in terms of an analogy.
Which is something like MDP is to RL
as stochastic game is to multiagent RL. It's a
formal model. That lets us express the sorts of
problems that take place in this formalized problem setting.
- Hm. That sounds very promising.
- Cool. Alright so let me let me give
you a, I'll start off by explaining it in
terms of an example and then I'll give a
more formal definition because you know, I can't not.
So so this is a little game played between A and B. Oh, I should
have it between smooth and curly, but At
the traditionally it's played between A and B.
- Mm, and sometimes it's good to use the words that other people use.
- [LAUGH] I've heard that. I wouldn't say it quite
that way. So this is a three by three grid
each of the players can go north, south, east and
west. And can stay put if that's helpful. And the,
the transitions are deterministic, except for through
these, these walls here which are called semi-walls.
- So these thick lines represent walls that
you can't go through, the thin, wall, lines
just represent cell boundaries, but this kind of
dashed line here is a semi-wall, and that means
If you try to go through that, say by going north from, if A goes north
from this position, then 50% probability A will
actually go to the next state, and 50% probability
A will stay where A is. So, the goal is to get to
the dollar sign. And if you get to the dollar sign you get
a hundred dollars. So if we ignore A for a second, what should
B do to minimize the number of steps necessary to get the reward.
- Go left, and then go up and go up. Oh,
I'm sorry. Go west, and then go north and then go north.
- Yeah, and what should A do ignoring B?
- Go east and then go north and then north.
Unfortunately these guys live in the world together, and
what happens is, they can't occupy the same square. And as soon as
somebody reaches the dollar sign the game ends and the other
player, if the other player hasn't reached the dollar sign, gets nothing.
- I see.
- So now there's a little bit of contention.
- So what happens if A and B both try to go,
to the same square at the same time?
- Let's say that we flip a coin and one of them gets to
go, first and then the other one will bounce off of the first one.
- But that's not a problem when it comes to reaching the money.
- But it's not a problem, yes, right, so
the money is kind of like a money pit.
- [LAUGH] I don't think that's what a money pit is, but okay.
- And so they can dive in and they
both get the money, because they're in the money pit.
- I like it.
- So what do you do if you're A?
How do you play this game? Oh! Let's think of another thing. Is, can you think
of what, what it might mean to, to have a Nash Equilibrium in a game like this?
- Oh, that's an interesting question. It would
mean, well, it would mean, well, what do you
mean, what would it mean? It would mean
that, neither one of them would want to deviate.
- It would mean a pair of strategies for the two players.
Now the strategies are now multi-step
things that say, they're like policies, right?
- Like it's a pair of policies, such that neither would
prefer to switch. So can you think of
a pair of policies that would have that property.
- Well, no I'm not sure. I was trying
to think about that. I was thinking that kind
of, if I were a nice guy what I would want to do is I would want us both
to try to go through the, the semi walls,
and if we both go through the semi-walls we just
go up again and then we, we hit the
dollar sign at the same time. And that's very nice.
- So okay, good. So that,
that seems like a cooperative kind of strategy, right?
Where they're both you know, 50% oh I'm sorry,
both will go to the goal together. Hooray. But...
- 25% of the time neither one will get through and
then we're in the same place we were before, so that's okay.
- That's right.
- The problem is the other 50% where one
of them gets through and the other one doesn't.
- Right, so what do, what you do if
you make it through and the other one doesn't?
- What do I do, if I get through, and
the other one doesn't? Well if I am only going to
do this the one time then I just keep
going and get the dollar, and the other person loses.
- Yeah, alright, so what this works out to
be, is that A is going to get to the
goal. 2 3rd's of the time, and B is going
to get to the goal 2 3rd's of the time.
- So, alright, so if that's the case, if
I say, okay, A, that's what you should do, B,
that's what you should do. Then is there a
way that either A or B can switch strategies and
- Well, if B, for example, decides to go west and then go up, what happens?
- Yes, that's a good question. B will now make it to the
goal a 100% of the time, and A will only make it to the
goal 50% of the time. So B has an incentive to switch to
that, to this strategy if we tell them to both go through the semi-wall.
- So that wasn't a Nash Equilibrium. B would wanted switch this new policy.
- Is this a Nash Equilibrium?
- No. Wait, is it? No. Because, why doesn't A just choose to go west east?
- Well, would, would A do better on average by switching to this strategy?
- Well let's see. no, actually. Oh, no,
no, you said half the time they go through.
- So half the time you flip a coin. So half the time I don't make it.
- But half the time I do.
actually, it looks the same.
- It looks the same. That's right.
- And B would go from 1 to one half.
- Yeah, that's true. [LAUGH] So, it, A doesn't have an incentive to
do it, but B is hoping very much that A doesn't do that.
- So so, yeah. So that, so there's one Nash Equilibrium
where B takes the center. Another one where A takes the center.
I guess if, if they do, if we do this coin
flip thing, it, it works out this way. If it's the case
that if they both if we change the rules here. So
that if they collide, neither of them gets to go. Then go,
both trying to go to the center is not a Nash equilibrium
anymore, because you can do better by actually going up the semi-wall.
- Right. And so if we, if, if collision means nobody
goes through, then, suddenly, you'd want to do the other thing.
- Or one of you goes through the semi-wall and one goes the direct way.
So we can see that there's a bunch
of different Nash equilibrium here, sorry, Nash equilibria
here. And that it's not so obvious how you'd find them, but it is at least
clear that they exist and they have a
different form than what we had before, because
they're not policies instead of these otherwise simplified
just you know, choose this row of the matrix.
- Cool. Alright. So let's think about how
we might learn in these kinds of environments.
- Oh, okay, I like that already.
So, stochastic games, originally due to Shappley, have a bunch
of different quantities. State, actions,
transitions, rewards, and discount factors. And
here's how we're going to do it. We're going to, we're going to
say that s, little s, is, is one of the states.
And actions could be like little a, but actually, since
we're going to focus mostly on two player games for the moment,
I'm going to write actions as a and b. Where a
is an action for player one and b is an action
for player two. Sound okay?
- Alright. So next we have the transition function.
So, the transition function says: If you're in some
state, s, and there's some joint action that's taken,
like all the players choose their action simultaneously, a comma
b, then what's the probability of reaching some next
state s prime? And we can write rewards the same
way. So there's reward for player one, given that
we're in state s and there's a joint action ab.
And there's the reward for the other player, the second player.
And a discount factor is you know, like a discount factor.
- Totally makes sense. So oh its the same discount factor for everyone.
- Yes! Good a good point. One need not define things
that way, but in fact that is the way it's always defined.
- Hey, not to go of on a tangent here, but sometimes I see
NDP's and things like stochastic games defined
with a discount Factor being a part
of the definition, and sometimes not. Like it's just a part of
the prob, definition of the problem or sometimes its a parameter of an
algorithm. Which do you prefer? Why do you, why haven't, why have
the discount factor actually listed as part of the definition of the game?
- I have no justification other than it's nice to
have listed the things that might be important as oppose to
you know, working through algorithms for while and then saying,
oh yeah there's this other number that kind of matters too.
that's fair. I was just curious. So one of the things that I
actually find really interesting is that this model was laid out by Shapley.
- One of the, like a former Nobel prize winner. I guess
once you're a Nobel prize winner, you're a Nobel prize winner forever.
- Yeah, [INAUDIBLE].
- So, Shapley, the Nobel prize winner, and
as we're going to see in a moment, this
model is actually a generalization of MDPs, but
Shapley published it before Bellman published about MDPs.
- Oh. This is pre-Bellman.
So MDPs, to some extent, can be thought of
as a narrowing of the definition of a stochastic game.
- So, all right. Let's do a little quiz. And see that we
really understand the relationship between this model
and other things that we've talked about.
Models and Stochastic Games
- Alright so, Stochastic Games are more general than
other models that we've talked about. And so, just to
make that case here's a way of making the Stochastic
Game settings more constrained. And, by making them more constrained,
actually turning them into other models that we've talked
about or could talk about. So, I wrote down three
different ways of constraining the Stochastic Game model. One says
that we're going to make the reward function for one player
the opposite of the reward function for the other. The next one says that the
transition function has the property that for any state and joint action and
next state. If that's going to be equal to the transition probability for state
joint action, next state, where we've changed
potentially the choice of action for player two.
- So basically, player two doesn't matter
to the transitions or the rewards for player one, and the rewards
for player tow are always zero. So that's, that's again, you can specify
this as a Stochastic Game and then in the third case we
are saying that the number of states in the environment is exactly one.
- So I claim that by doing these restrictions. We get out the
mark-off decision process model, a zero
sum statistic game model, and the repeated
game model that we've been talking about in
the context of, like, the folk theorem. So, what
I'd like you to do is write the letters, A, B, and C in the correct boxes.
Models and Stochastic Games
Alright, talk me through it.
- Okay, so I'm going to say that I think I know the answers for this
one. And let's start with the first one.
So R1 equals minus R2, which you'll notice
they're equal and opposite. And in fact if you add them up, that is you sum
them you end up with zero. So I'm
going to say that's a zero sum stochastic game.
- For two, basically you're saying that for all intends
and purposes, there's only one agent. Which just makes it
a regular Markov decision process.
- Yeah. So isn't that interesting? That
just by the other player irrelevant, then
that's what an MDP is. It's like a game where the other players are irrelevant.
- Yeah, which, both of my children are like that.
But okay, I think that's pretty cool. And in fact, I'd
be right in saying that R2 doesn't have to equal
to zero. As long as it just equals to some constant.
- Yeah, that's, I mean, constant. Actually, depending on
how you think about it, it could be, we could just ignore
the whole R2 thing and just say that. As far as the
first player is concerned, since the second player really has no impact
on anything. It doesn't matter. But the reason I put that in
is I got kind of scared that like. I feel like if I
lived my life and knew that my actions effected the state and
my rewards, but they were also effecting the rewards of somebody who
didn't matter. Like I feel like that would actually still have an influence
on me. Sure, but then the way you get around that is
you would say, well, your R1 is actually equal to your R2.
- [CROSSTALK] It would somehow [UNKNOWN].
- So, so if I had gone like that, wouldn't
that be the case then that we're saying? Oh yeah,
I see. That the second player is irrelevant, but the
reward, but the first player may be relevant to both.
- Yeah, okay, yeah I like that a little bit better. Yeah,
I mean, once again, it all boils down to changing the rewards. Okay,
and so given A and B, I know the answer to
three must be C, unless you're tricking me, and it could be
A or B again. And which I suppose you could have
done, you didn't say they were mutually exclusive. So let me actually
argue why it would be C? Well there is only one
state and since you're in a stochastic game and you're going to
be continually doing this. It means that you're basically doing the same
game over and over and over again, so it's a repeated game.
- Yeah, yeah, yeah so in particular the actions impact the rewards,
but they're not going to impact the transitions
because you're always just going to back to
where you were. The discount factor plays the
role of, of decided when the game's going
to end, stochastically. And, so yeah, it's exactly
a repeated game, this is the one I
feel most comfortable about because this really does
recover that same model we've been talking about.
- I like it.
- Cool! Now, given that we actually are now
in a model that generalizes MDPs, it would be
nice if we can generalize some of the things
that we did with MDPs, like Q learning and
value iteration to this, this more general setting.
So, that's what we're going to try next.
Zero Sum Stochastic Games
Now what makes stochastic games more interesting, perhaps, than repeated
games, is the idea that the actions that the players take
impact not just the rewards, but also future states. Right? And,
so this is the same issue that comes up when we're
talking about mark off decision processes and the way we
dealt with it in that setting was by defining a value
function. So, it seems pretty reasonable to try to go after
that same idea again. So what I've got here is actually
the Belmont Equation And let's look at this together and let's
see if we can fix it because it's not quite right.
- For dealing with the idea of zero sum in a stochastic
game. Okay so you remember the Belmen equation? We've got Q I star.
- So there was no I before, but Q star is the state. Is
to find over state actions, so here
we're going to define it over action, joint actions
- For the two players, action pairs. The immediate reward to player
i for take, for that joint action in that
state plus the discounted expected value of the next state.
So we need to factor in the transition probabilities So
the transition of actually going to some next state s
prime is s, sorry t of s, AB as
prime. Right? So now, we're imagining what happens when we
land in S prime. So what I've written here says,
well, we're going to basically look a the Q values
in the state that we landed in, and kind
of summarize them, summarize that value back up so
that we can use it to define the, the
value of the state that we left. You with me?
- I am with you.
- Alright so if we put this in if, if we say the
way we're going to summarize the value for the new state that we land
in. Is we think of it as actually a matrix game. That there's
payoffs for each action choices of A prime and B prime. And over all
of those, we need to kind of summarize well which of those actions in this table
of values that we get for s prime.
Which of those values are we going to propagate forward
and call the value of that state? So what we did in regular MDPs is that we
said we'll take a max over all the
actions or in this case all the joint actions.
- Uhun. So what do you think that translates to?
- Well you wrote down max but that doesn't make sense,
that doesn't, that can't be right.
- Well it translates, it means something. It
just doesn't mean what we mean it to mean.
- That's true, that's fair.
- So what does it mean and then, and then, how can
we fix it? So let's start off with what does it mean.
- It means that you kind of always assume
that the joint actions that are going to be taken
Will benefit you the most. So, everyone, everyone is
trying to make you happy so, this makes you optimistic?
- yeah, sort of optimistic
to the point of, of
- Yes, very good. Right, it just basically says that whenever we're in a
state, the whole world is going to choose
their actions to benefit me, and this is
not what we get a say a zero sum stacastic game, but a zero
sum stacastic game, we should be you know. Like fighting it out at this point.
- So that would work out if everybody's rewards were the same,
or everybody's rewards were the sum
of everyone's rewards, or something like that.
- That's right.
If it was some kind of team based game.
- Or if everybody was, you know, going to sacrifice their
own happiness for the benefit of Qi, or i I mean.
- Hm. So it's not reasonable to assume that. In
fact, what was it that we were assuming when we
had a zero sum game that was just a single
stage? Right? Just a single game and then we were done.
- Oh, that people were doing minimax.
- And maximin.
- So what if we changed the equation to look like that?
So, what I mean by this, is when, when we
we evaluate the value of a state. We actually solve the
zero sum game in the Q values and take that value
and use it in the, in the context of this equation.
- That seems closer to right.
- Yeah. I mean, it's not an unreasonable
thing to do. It's just to say, I'm going to
summarize the future by imagining that we're going to play
that game that represents, you know, all the future.
- And I'm going to act in such a way to,
to try to Maximize that assuming you're trying to minimize
it, which makes perfect sense if it's a zero-sum game.
- Right. I was, yeah, and we're still, we're
still acting as if there are only two people here.
- Yeah, yeah, that's right. It turns out
that when you're talking about zero-sum it really
implies that there's only two players. Because if
you have a zero-sum three player game, it
really is just a general sum game. You can imagine that the third player is just
an extra factor that's just messing with the
numbers to make things sum up to zero.
So, yeah, so zero sum really does kind of focus on this two player setting.
- That makes sense.
- So we got this modified kelvin equation
and we can even translate it into a form
that's like Q learning. So the analog of
the kelvin equation and the Q learning update in
this setting would be that we. If we're
in some state, there's some joint action that's taken,
there's some pair of rewards that comes and
some next state that's visited that the Q value
for that state, joint action pair, is going to be updated to be closer
to, the reward for player i plus the discounted expected value, or sorry, the
discounted Summarized value or v value of the new states as prime, and we'll
again, we'll use mini-max to summarize what the values are in that new state.
- I like it.
- And that equation is sometimes referred
to as mini-max Q, because it's like
the Q learning update but just with the mini-max operator instead of a max.
- That makes sense.
Zero Sum Stochastic Games Two
So, if we set things up this way, we actually
get some wonderful properties coming out. So here are some
things we know about this set up for zero-sum stochastic
games. Value iteration works, so we can actually solve this
system of equations by using the value iteration trick, which
is to say, we initialize these Q values to whatever
and then we just iterate this as an assignment, right,
we just say, you know, equals. So value iteration works.
This minimax Q algorithm converges under the same kinds
of conditions that Q learning converges, so we get this
nice, you know, Q learning analogue in this multi-agent
setting. The Q star that's defined by these equations is
unique, so we iterate it and we find it
and it's just, there's just that one answer. The policies
for the two players can be computed independently, that
is to say, if two different players are running minimax
Q on their own and not really coordinating with each other except for by
playing the game, that the policies that
they get out will actually converge to minimax
optimal policies. So it really does solve the zero sum game, which is maybe not
so surprising because, you know, they are
trying to kill each other after all. [LAUGH]
- So, the idea that they'd have to
collaborate to do that efficiently would be weird.
- [LAUGH] I never thought about it
like that, but, yeah, that would be weird.
- The, this update that
I've written here can be computed efficiently, which is to say
in polynomial time. Because this minimax
can be computed using linear programming.
- Yes of course.
- And, finally, if we actually iterate this Q
equation and, and it's converging to Q star, knowing Q
star is enough to figure out how to behave optimally.
So we can convert these Q values into an actual
behavior, again, by using the, the solution in the linear program.
- So it's just like MDPs of value iteration with Q-learning?
- Exactly. It's like we've gone to, to
a second agent and it really hasn't impacted things
negatively at all. This is, this is all
the, pretty much all the things that we want,
come out. There, there are some things that
don't come out. For example, in the case of
an MDP, we can solve these, this system
of linear equations in polynomial time. Not just by
value iteration, but we can actually set up it as a single linear
program and solve it and be done in linear time or, sorry, not linear
time, polynomial time. This is not known to be true in the zero-sum
stochastic game case, it's not known whether
it can be solved in polynomial time.
- So there, it is a little harder as a problem, but it's, you know, not
harder, not deeply harder and not harder in
a way that matters in a machine learning setting.
- So this is really great. So let's,
let's try to take this same approach and
see if we can deal with general sum games.
General Sum Games
- So okay, so let's think about General sum games, so not zero sum any
more. But we're not, you know, restricted,
it could be any kind of relationship between
the two players. And so the first thing we need to do is realize well,
well we can't really do minimax here
any more. Right, because that doesn't make sense.
- Right. That only works with zero-sum games.
- Well it's only, yeah. That's, well, it sort of assumes that the
other player's trying to minimize my reward and that's not the, that's not the
concept of the Nash equilibrium. We'd like to
do something analogous and find a Nash equilibrium in
this general sum setting. So what, what operator do
you think we would need in this context here?
- Nash equilibrium?
- Yeah, so that would be a very reasonable thing to
do, is instead of computing mini max, we actually compute of the
two matrix game, right, using Q1 and Q2, compute the Nash
equilibrium of that and propagate that value back. It's a well defined
notion, right, that we can summarize the value of these two pay of matrices
with with a pair of numbers which are the values of the Nash equilibrium.
- Alright, so so good. So we can
do the same thing in the Q learning setting.
Substitute in a Nash equilibrium. And we can call
that algorithm Nash-Q, which is, appears in the literature.
- Oh minimax Q by the way is something
that I wrote about. Nash-Q is a different algorithm.
- So it's not as cool, is what you're saying.
- Well, let's let's see how it goes. So this is
now an algorithm, you can actually, well, this set of equations it's
not exactly clear what it means, but we can think about turning
that into value iteration, right? By turning this into an assignment statement.
- So, what happens? Well, value iteration
- So, yeah, so if you, you repeat
this over and over again, things, weird things
can happen, it doesn't, it doesn't really converge,
it doesn't really solve this system of equations necessarily.
- And unfortunately the, the reasoning here is even
harder in the case of Nash-Q because in the case
of Nash-Q, it's really trying to solve this system of
equations using something like value iteration, but with extra stocasticity.
And so it also suffers the same problem.
It doesn't necessarily converge. There's not really a
unique solution to Q star because you can
have different Nash equilibria that have different values.
- So there isn't really much hope of converting to
the answer because there isn't the answer. The the policies
can not be computed independently, right, so Nash equilibrium is
really defined as a joint behavior, and so we can't
just have two different players computing Q values. Even if we could compute the
Q values. It wouldn't necessarily tell us what to do with the policies, because
if you take two different policies that are both half of a Nash equilibrium,
two halves of a Nash equilibrium do
not necessarily make a whole Nash equilibrium.
- because they could be incompatible. So, you know, so far so good, right?
- Yeah, I can't wait to see what happens next.
- The update is not efficient unless
P equals PPAD, which is to say, computing a Nash
equilibrium is not a polynomial time operation as far as
we know. It is as hard as any problem in
a class that's known as PPAD. And this is actually a
relatively recent result, in the, in the last five, ten
years. And this class is believed to be as hard
as, as NP. So, possibly harder. So it doesn't really,
doesn't really give us any leverage to, computational leverage to kind of
break it down in this way. So that's unfortunate. And
finally, the last little hope of, well, maybe we can define
this kind of learning scenario using Q functions the same way
we've been doing, Q functions are not sufficient to specify the
policy. That is to say, even if I could do all
these other things, efficiently compute a solution of, you know, build
the Q values, make them so that they're compatible with each
other. And now I just tell you, here's your Q function.
Now decide how to behave, you can't. It's, there's not enough information.
- You're depressing me, Michael.
- Yes, so this is kind of sad. We go to the general sum
case, which in some sense is the only case that matters' because zero sum
never really happens. And what we discover is that we lose all, seemingly lose
all of the leverage that we have in the context of Q type algorithms.
- Mm, mm, mm.
- And that's where we'll stop.
- Oh. So we're going to end on a high note.
- No, maybe we should say something before we depart.
- Let's do that. Come up with something positive to say.
Lots of Ideas
So even though things are kind of grim, with regard to
solving the general sum games. There are lots of ideas that,
that have proven themselves to be pretty useful for addressing this,
this class of games. It is not the case that any one
of them has, has emerged as the dominant view, but, but
these are all really cool ideas. So here's one. You can think
about stochastic games as themselves
being repeated. So, repeated stochastic games.
We're going to play a stochastic game and when it's over, we're going to
play it again. And that allows us to build
folk theorem-like ideas at the level of stochastic games.
- Oh, that's cool.
- And so there are some efficient
algorithms for dealing with that. So that's
one idea. Another one is to make
use of a little bit of communication side-channel
to be able to say, hey, other player. Here's this thing that I'm thinking about.
And it's cheap talk in a sense that, it's nothing that's being said is binding
in any way but it gives the two players
the ability to co, to coordinate a little bit.
And you can actually ultimately compute a correlated equilibrium,
which is a, a version of a Nash equilibrium
that you know, requires just a, a little bit
of coordination, but can be much more efficient to
compute. And you can actually get a near optimal
approximations of the solution to stochastic games using that idea.
- Yeah, that's
cool. Didn't, didn't I do some work in this space?
- You did. That's where I got the idea from.
- Oh okay.
- There's some, some work by Amy Greenwald looking
at how correlated equilibria play into stochastic games and
then your, your student Liam and you developed a,
a really cool algorithm that actually probably approximates, the solutions.
- Another idea that I've heard a lot about lately, that I
really like, is the notion of a cognitive hierarchy. The idea that what
you're going to do is instead of trying to solve
for an equilibrium, you think about each player as assuming
that the other players have somewhat more limited computational
resources than they do. And then taking a best response
to what they believe the other players are going
to do. This turns out to be a really good
model of how people actually play when, when you
ask them to do games like this in the laboratory.
- Yeah, the good news about this idea is
that, because they're best responses, they can
be more easily computed. That, that it's more
like, cue learning in MDPs again because you're
assuming that the other player is, is fixed.
- Okay. I'll buy that.
- And, the last idea I want to throw
out is the notion of actually using side
payments so that the players, as they're playing
together, cannot only take joint actions, but they can
say, hey, I'll give, I'm going to get a
lot, but if we take this action, I'm going to
get a lot of reward. I'm going to give some of that reward back to you, and that
will maybe encourage you to take the action that
I need you to take so that we'll both
do better. And so there's this lovely theory by
a father and son duo that they call coco values.
Coco sounds awesome but it stands for Cooperative competitive
values [CROSSTALK] and so it actually balances the zero
sum aspect of games with the mutual benefit aspect
of games. So it's, it's, it's a really elegant idea.
- So basically, the problem
isn't solved but there are a lot of cool
ideas that are getting us close to solving it.
- That's right. Yeah. So even though the one player and
the zero sum cases are pretty well understood at this point, the
general sum case is not as well understood. But there's a
lot of really creative ways that people are trying to address it.
- So, that is good news.
What Have We Learned
Okay Charles, what have we learned? And I mean
specifically in the context of this game theory two lesson.
- That's a that's a good question. We learned
about Iterated Prisoners Dilemma. Which turns out to be cool,
and it solves the problem, and we learned about
how we can connect iterated prison's dilemma to reinforcement learning.
- What do you mean?
- Through the discount.
- Yeah, so I think of that as being the
idea of repeated games.
- Right. Let's see, what else have we learned? So we learned about
iterated prison's dilemma, which allowed us to get past this really scary thing
with repeated games, connected it with
reinforcement learning. The discounting. And then we
learned other things like for example, I
don't remember. What, what did we learn?
- Well so the, the connection between iterated prisoner's dilemma and
repeated games was the idea that we can actually encourage cooperation.
And in fact, there's a whole bunch of new Nash equilibria that appear
when you work in repeated games. That was the concept of the Folk Theorem.
- Right. The Folk Theorem. So, the Folk Theorem
is really cool. And this whole notion of repeated
games really seems like a clever way of getting
out of what appear to be limitations in game theory.
- Right. Yeah. And in particular by using things like threats.
- Right. But only plausible threats.
- Right, so that
was the next thing we talked about. The idea
that an equilibrium could be subgame perfect or not,
and if it wasn't then the threats could be
implausible. But in the subgame perfect setting, they're more plausible.
- Right let's see and then we learned about Min-max Q.
- Well there was one last thing we did
on the repeated games, which was the Computational Folk theorem.
- Yes you're right. So basically what we learned
is that Michael Littman does cool stuff in game theory.
- Or at least he does
stuff that he's willing to talk about in a MOOC.
- Yes, so that's, that's there's actually a
technical term for that right? MOOC acceptable research?
- Oh, I didn't know that.
- So all these things are by virtue of the
fact that they showed up in this class look acceptable.
- Alright, you're right, but then we switch to stochastic games.
- And they generalize MDP's and repeated games.
- Anything else?
- Well, that particularly got us to min-max
Q and then eventually to Nash Q. But
despite the fact that Nash Q doesn't work, we ended up in a place of hope.
- [LAUGH] We end with some hopefulness.
- Yeah, and you know, I think that
that's actually a lesson for the entire course.
That at the end of the day, sometimes
it doesn't always work, but there is always hope.
- [LAUGH] We don't give up and that's, that's, that's how research works.
Even when we have impossibility results for things like clustering, or
multi-agent multi-agent learning and decision
making, we still keep struggling forward.
- And keep learning, and isn't that what's really important. I think so.
- Its important for us, and its important for machines.
- Yes, that is beautiful. I feel like we've made
it to a good place, Michael. Perhaps we should stop.
- [LAUGH] Well it has been, it has been
delightful getting to talk to everyone, and it has been
very fun getting to talk with you, Charles.
And thanks to everybody for making this happen.
- I agree. And we have one more chance to talk with one another as we
wrap up the class. And I look forward to that. So, I will see you then, Michael.
- Awesome. Do we get to see each other in person for that?
- We get to see each other in person for that. That will be fun.
- Okay, well, bye, Michael. I'll see you next time.
- Bye. See yeah.
- Bye, bye.