Ensemble Learning Boosting
- Hey Charles, how's it goin'?
- It's going pretty well Michael. How's it going with you?
- Good, thank you.
- Good, good, good. Guess what we're going to talk about today?
- Well, reading off this screen, it looks
like maybe ensemble learning, and boosting, whatever that is.
- Yes, that's exactly what we're
going to talk about. We're going to talk about
a class of algorithms called ensemble learners. And I think you will See
that they're related to some of the stuff that we've been doing already,
and in particular we're going to spend most of our time focusing on, boosting.
because boosting is my favorite of the
ensemble learning algorithms. So you ready for that?
- Yeah! Let's do it.
- Okay. So, I want to start this out by,
going through a little exercise with you. I want you
to think about a problem. Okay. And the particular
problem I want you to think about is, spam email.
- Mm, I think about that a lot.
- So, normally if we were thinking of this a classification
task, right, where we're going to take some email and we're going to decide
if it's spam or not. Given what we've talked about so
far we would be thinking about using a decision tree or
you know neural networks or k and n whatever that means
with email. We would be coming up with all of these sort
of complicated things. I want to propose an alternative which is
going to get us to ensemble learn. OK and here's the alternative.
I don't want you to try to think of some complicated
Rule that you might come up with that would capture spam email.
Instead, I want you to come up with some simple rules
that are indicative of spam email. Okay, so let me be
specific, Michael. We have this problem with spam email. That is,
you you're going to get some email message and you want some computer
program To figure out automatically for you if something is a
piece of spam or it isn't. And I want you to help
write a set of rules that'll help me to figure that
out. And I want you to think about simple rules, so can
you think of any simple rules that might indicate that something is spam?
- Alright I can, yeah I can think of some simple rules. I don't think
they would be very good, but they might better than nothing. Like If for example
it mentions how manly I am, I, I would be, be willing to believe that
was a spam message. So like if the body of the message contains the word manly.
- Okay, I like that. like that when body
I like that rule, because I often get non-spam messages talking about
manly. So I guess one man's spam is another man's normal email.
- [LAUGH] I guess that's true.
- Probably. Any other rules?
- Sure if it, you know if it comes from my spouse it's probably not spam.
- OK, so let's see, from spouse.
- Her name's Lisa. Now we're going to call our spouse. So
let's say minus, I'm going to go to plus here, so we know
some rules are indicitive of being spam, and some
rules are indicitive of not being spam. Okay, anything else?
- Possibly the length of the message. I guess. Like what?
- I don't know. I don't know that this would be very
accurate, but I think some of this, some of the spam I
get sometimes is very, very short just like the, it's like the
URL. Like hey, check out this site, and then there's a URL.
- Hm, I like that. So, we'll just say short.
Just contains URLs. Hm, I like those rules. Let's see if we can
think of anything else. Oh, how about this one. It's just an image.
- I get a lot of those where it's just an image.
- I see, like in it's it's and if you
look at the picture it's all various pharmaceuticals from Canada.
- Exactly. Here's one I get a lot.
- Lots of misspelled words that you end up reading as being
a real word.
- Hm. But I don't know how I'd write that
as a rule. Or you could just list the words.
- Like rules that, words that have already
been modified in that way. I guess so.
- Yeah, kind of a, kind of a black list, a black list of words.
- Okay so, words like, I would say manly, but you were saying prawn.
- Or whatever that says. yeah, so they're and they're tons
of these. Right? I mean, another one that's, that's very popular
if you're old enough anyway is this one, remember this one?
- Oh, sure that was sometimes a virus, right?
- Yes. Our young, our younger viewers will not know this but this was one
of the first big spam messages that would
get out there. Make money fast. And there's
tons and tons of these. We could come up with a bunch of them. Now, here's
something they all kind of have in common,
Michael, and you've touched on this all ready.
All of them are sort of right. They're useful but no one of them is going to be
very good at telling us whether a message has
spam on its own. Right. So the word manly is
evidence but it's not enough to decide whether something
is spam or not. It's from your spouse, it's evidence
it's not spam, but sometimes you get messages from
your spouse that are in fact spam, because in fact,
she didn't actually send them. You know, and so
on and so forth. And sometimes she did email
from Princes in Nigeria. I didn't. And they're not
always spam. I, I actually do, but any case,
sometimes people are asking you for money, and maybe
that's message you want to ignore, but it isn't
necessarily spam. And some people are very interested in
getting like this and don't consider it spam, right?
- So, so, okay, so I can see that these would
all maybe provide some evidence, but it seems really hard tp figure out the
right way of combining them all together to I don't know, make a decision.
- Right, this is exactly right. And by the
way, if you think about something like decision tree, you
could. There's really a sort of similar problem going on
there. We can think of each of the nodes in
a decision tree as being a very simple rule and
the decision tree tells us how to combine them. Right?
So, we need to figure out how to do that
here and that is the fundamental notion of ensemble learning.
- But wait, isn't, couldn't you also do something similar with
something like neural net. Right? Where each of these now becomes a
feature and we're just trying to learn ways for combining them
all together. So That would kind of satisfy what you were talking about.
- True, I mean I think the the difference here in this case and
and I think you're absolutely right but
one difference here is that typically with the
new network we've already built the network
itself and the nodes and we're trying
to learn the weights whereas in something
like a decision tree you're building up rules
as you go along. And typically with ensemble learning you're building up a
bunch of rules and combining them together until you got something that's good
enough. But you're absolutely right. You
could think of [UNKNOWN] networks as being
an ensemble of little parts. Sometimes
hard to understand, but an ensemble nonetheless.
Ensemble Learning Simple Rules
- So, the characteristic of ensemble learning is first this, that
you take a bunch of simple rules, all of which kind
of make sense and you can see as sort of helping.
But on their own, individually, do not give you a good
answer. And then you magically combine them in some way to
create a more complex rule, that in fact, works really well.
And ensemble learning algorithms have a sort of basic form to
them, that can be described in just one or two lines.
So let me do that and then we can start wondering a little bit
how we're going to make that real. So here's the basic form of an ensemble
learning algorithm. Basically you learn over a
subset of the data, and that generates
some kind of a rule. And then you learn over another subset of the data
and that generates a different rule. And then you learn over another subset of
the data and that generates yet a third rule, and yet a fourth rule, and
yet a fifth rule, and so on and so forth. And then eventually you
take all of those rules and you combine them into one of these complex rules.
So, we might imagine in the email case that I might
look at a small subset of email that I know is already
spam and discover that the word manly shows up in all of
them and therefore pick that up as a rule. That's going to be
good at that subset of mail, but not necessarily be good
at the other subset of mail. And do the same thing and
discover that a lot of the spam mails are in fact short
or a lot of them are just images or just urls and
so on and so forth. And that's how I
learn these rules by looking at different subsets. Which
is why you end up with rules that are very good at a small set of the data, but
aren't necessarily good at a large set of the
data. And then after you've collected these rules, you combine
them in some way, and there you go. And
that's really the beginning and the end of ensemble learning.
- So wait. So, when you say manly was in a lot of the positive
examples. Do you mean like it distinguishes the
positive and the negative example? So it should
also not be in the negative examples.
- That's right. That's exactly right. So think of
this as any other classification learning problem that you
would have where you're trying to come up with
some way to distinguish between the positives and the negatives.
- And, and now why does it, why are we are looking at subsets of
the data? I don't understand why we can't just look at all, the whole data.
- Well, if we look at all of the data,
then it's going to be hard to come up with these, these
simple rules. That's the basic answer. Actually, ask me that
question a little bit later, when we talk about over fitting,
and I think I'll have a good answer for you. Okay,
so here we go Michael. This is Ensemble Learning. You learn
over a subset of the data over and over again picking
up new rules and then you combine them and you're done.
Ensemble Learning Algorithm
- Here's the Ensemble Learning algorithm. We're done,
Michael, we're done with the entire lesson.
We don't have to do anything else anymore. We know that we're supposed to
look over subset of data, pick up rules, and then combine them. So, what
else do you need to know in
order to write your first Ensemble Learning algorithm?
- So, I'm already kind of uncomfortable with this notion of"
combine," right? So, like, I can think of lots of really dumb
ways to combine things. Like, choose one at random or, you
know, I don't know, add em all up and divide by p
,i I mean so,so presumably there's gotta be
some intelligence in how this combination is taking place
- Yes, you would think so, but your not
at all bothered about how you pick a subset?
- Oh ,I was imaging you meant random subsets.
- Oh ,so you'd automated assumption about how we were going to
pick a subset. You just [CROSSTALK] werent sure how to combine them. Well
actually lets explore that for a minute. Here's kind of the dumbest
thing you can imagine doing. That turns out to work out pretty well.
We're going to pick subsets, by, I'm going to say uniformly. Just to
be specific about it. So ,we're going to do the dumbest thing
that we can think of, or one one of the dumbest
things you could think of. Or maybe ,we should say simplest and
not dumbest so as not to, to, to make a value
judgment. That you can think of doing which would be to just
uniformly randomly Choose among some of the data, and say that's
the data I'm going to look at, and then I'm going to apply some
learning algorithm to it. Is that what you were thinking of Micheal?
- Okay, so just pick a subset of the
data, apply a learner to it. I'll get some
hypothesis out, I'll get some rule out. And now
I'm going to combine them, so since were being simple. Why
don't we try doing something simple for combining. Let's
imagine, Michael, that we're doing a regression. What's kind of
the simplest thing you could do if you have
ten different rules which tell you, how you should be
predicting some new data point? What's the
simplest thing you could imagine doing with it?
- So, okay, so each of them spits out a number. I guess if we kind
of equally believe in each of them, a
reasonable thing to do would be to average.
- Great. So, a very simple way of combining, in the case of
regression, would be to average them. We'll simply take the mean. And by
the way, why wouldn't we equally believe in each of them. Each one
of them learned over a random subset of the data. You have no reason.
- To believe
one's better than the other.
- There's a couple of reasons. One ,it could be a
bad random subset. I don't know how I would measure that.
- I could be a good random subset.
- Yeah. Then we'd want, we'd want that to count more in
the mean. But, but I guess what I was thinking more in terms
of maybe for some of the subsets you know, it gets more
error than others or it uses more complex rule than others or something.
- I could imagine that. Actually maybe we can explore how this sort of idea
might go wrong. Let's, let's do that. Maybe we can do that with the quiz.
You like quizzes, right?
- They're important.
Ensemble Learning Outputs
- Okay, so here's a quiz for you Micheal, here's the
setup, you ready? You've got in data points. The learner
that you're going to use over your subsets is a
zero or a polynomial. The way you're going to combine the
output of the learners is by averaging them. So, it's
just what we've been talking about so far, and your
subsets are going to be constructed in the following way. You
uniformly randomly picked them and you ended up with N different
subsets or disjoint, and each one has a single point
in it, that happens to be one of the data points.
- Okay i think i get that
- Right, so if you look over here on your left
you got a graph of some data points and this is
one subset This is another subset, that's another subset, that's another
subset, that's another subset, that's another
subset, that's another subset, got it.
- Yeah, now what do you want to know about it.
- Now what I want to know is when you do your
ensemble learning you learn all these different,
you learn all these different rules and then
you combine them ,what is the output going to be? What does the ensemble output?
- And you want a number?
- I want a description and if the answers a number that's a
perfectly fine description. But I'll give you a hint, it's a short description.
- A short description of the answer. Okay, I'll think about it.
Ensemble Learning Outputs
- Okay Michael have you thought about it? Do you know what the answer is?
- Yeah. I think, you know, you asked it in a
funny way, but I think, what you're asking maybe was pretty
simple. So let, let me, let me see if I can
talk it through. So ,we've got n data points, each learner is
a zeroth order polynomial. So you, you said the ensemble rule
is that you learn over a subset, a zeroth order polynomial
is just (no period) Well, we said that the thing that
minimizes the average. Sorry, that minimizes the expected error, or the squared
error [INAUDIBLE] it's just the average. So, if
the sets are indistinct sets, with one data point
each, then each, of the individual learners, is
just going to learn the average. Then they get, not
the average sorry. The actual output value of
each individual point is the average, and then the
combining algorithm, to combine all the pieces of
the ensemble into one answer, combines with the mean.
So ,it's going to combine the mean of those
each of which is the data point, so it's
the mean of the data points. So, the ensemble
outputs, I don't know, I'd say average or mean?
- Or zeroth [INAUDIBLE] polynomial of the data set,
or, you know, one node decision tree, or ,uh.
- A constant? Which happens to be the
mean of the data. Haven't we seen this before?
- It seems to come up a lot,
when we are outputting very simple hypotheses.
- Right. And the last time we did this, if I recall correctly, this is what
happens ,if you do an unweighted average with k and n where k is equal to n.
- Oh, right. Like, like, right. And N-NN.
- Mm, so we should probably do
something a little smarter than this then. And,
I thought that we might look at some of the housing data, because, no
one's started looking at the housing data
yet. [LAUGH] Okay, so let's look at that
right quick and see if we can figure out how this works. And then
see if we can do something a little bit better, even better than that. Okay?
Ensemble Learning An Example
- Alright, Michael, so, here's what you have before you. You have the same
housing data that we've looked at a couple of times before. I've, for the
sake of readability, I've drawn over, some of the data points so that
they're easier to see. This is exactly the data, that we've always had. Okay?
- Now, you'll notice that I marked one of
them as green, because here's what we're going to do.
I'm going to take the housing data you've got, I'm
going to do some ensemble learning on it. And I'm
going to hold out the green data point. Okay? So of
the nine data points, you're only going to learn on 8
of them. And I'm going to add that green data point
as my test example and see how it does. Okay?
- Okay. So that sounds like, cross validation.
- It does. This is a cross validation. Or you could just say,
I just put my training set and my test set on the same slide.
- Okay, Michael, so the first thing I'm going to do is I'm going to pick
a random subset of these points. And just for the
sake of the example, I'm going to pick five points
randomly. And I'm going to do that five times. So
I'm going to have five subsets of five examples. And by the
way, I'm going to choose those randomly, and I'm going to choose
them with replacement. So we're not going to end up in the
situation we ended up just a couple of minutes ago
where we never go to see the same data point twice.
- Alright. So 5 subsets of 5 examples, and then I'm
going to learn a third order polynomial. And I'm going to take those
I'm going to combine them by averaging. Want to see what we get?
- Oh, yeah, sure.
- So here's what you get, Michael. Here I'm showing you a plot over
those same points, with the five different
- Yeah. They're, right. There's like a bunch of wispy hairs.
- Just like most third order polynomials. And as
you can see they're, they're kind of you stare at
them and you see their kind of similar. But
some of them they veer off a little bit because
they're looking at different data points. One of them
actually very hard to see because it's only one like
this. Actually veers off like this because just, purely
randomly, it never got to see the two final points.
- I see.
But they all, but they all seem to be pretty much in
agreement, like between points three and
four. There's a lot of consistency there.
- Right. Because just picking five of the
subsets you seem to be able to either get
things on the end, or you get things in
the middle. And maybe one or two things on
the end it sort of works out. Even the
one that doesn't see the, the last two points
still got to see a bunch of first ones
and get this part of this space fairly right.
So the question now becomes how good is
the average of these compared to something we might
have learned over the entire data set? And here's
what we get when do that. So what you're
looking at now Michael, is the red line,
is the average of all of those five third
order polynomials. And the blue line, is the fourth
order polynomial that we learned when we did this
with simple regression, a couple of lessons back.
- And you actually see them pretty close.
- Why is one of them a fourth order, and one a third order?
- Well what I wanted to do is try a simpler set of hypothesis,
than we were doing, than when we were doing full blown regression. So third
order simpler than fourth order. So, I thought we'd combine a bunch of simpler
rules. Then the one we had used before and see how well it does.
- You want to know how well it does?
- I would!
- Well it turns out that on this data
set and I did this many, many, many times
just to see what would happen with many different
random subsets. It typically is the case that the
blue line always does better on the training set,
the red points, than the red line does. But
the red line almost always does better on the
green point on the test set or the validation set.
- That is kind of interesting. So wait, so let me get this straight.
It seems sort of magical. So, so it learns an average
of third degree polynomials, third order polynomials, which is itself a
third order polynomial. But you're saying it does better by doing
this kind of trick than just learning a third order polynomial directly.
- Yeah, why might you think that might be?
I have a guess, you tell me what you think.
- Wow, so well, I mean, you know, the danger is often
over fitting, over fitting is like the scary possibility. And so maybe
by, by kind of mixing the data up in this way and focusing
on different subsets of it. I don't know. Somehow manages to find the
important structure as opposed to getting
misled by any of the individual datapoints.
- Yeah. That's the basic idea. It's kind of the
same thing, at least that's what I, I think that's
a good answer. It's basically the same kind of argument
you make for cross validation. Alright. You take a random
bunch of subsets. You don't get trapped by one or two points that happen to
be wrong because they happen to be wrong
because of noise or whatever. And you sort
of average out all of the variances and
the differences. Hm. And often times it works.
And in practice this particular technique. Ensemble learning
does quite well in getting rid of overfitting.
- And what is this called?
- So, this particular version, where you take
a random subset and you combine by the mean, it's called bagging.
- And I guess the bags are the random subsets?
That's how I'm going to think of it.
- That's how I'm going to think of it. It also has another name
which is called bootstrap aggregation. So I
guess the different subsets are the boots.
- [LAUGH] No,no, no, no bootstrap usually
refers to pulling yourself up by your bootstraps.
- Yeah, I
like my, I like my answer better. So, each of the subsets
are the boots and the averaging is the strap. And there you go.
So, regardless of whether you call it bootstrap aggregation or you call it
bagging, you'll notice it's not what I said we were going to talk about
during today's discussion. I said we were going to talk about boosting. So we're
talking about bagging but we're going to
talk about boosting. The reason I wanted
to talk about bagging is because it's really the simplest thing you can
think of and it actually works remarkably well. But there are a couple
of things that are wrong with it, or a couple
of things you might imagine you might do better. That
might address some of the issues and we're going to see
all of those when we talk about boosting right now.
- Okay so, let's go back and look at our two
questions we were trying to answer. And so far we've
answer the first one, learn over a subset of data,
defined a rule by choosing that subset uniformally randomly and
applying some learning algorithm. And we answered the second question,
which is how do you combine all of those rules
of thumbs by saying, you simply average them. And that
gave us, bagging. So Michael, I'm going to suggest an alternative
to at least the first one. And leave open the
second one for a moment. That's going to get us
to what we're supposed to be talking about today, which
is boosting. So let me throw and idea at you and
you tell me if you think it's a good one.
So rather than choosing uniformly randomly. Over the data, we should
try to take advantage of what we are learning as
we go along, and instead of focusing just kind of randomly,
we should pick the examples that we are not good at.
So what do i mean by that? What i mean by that
is. We should pick a subset based upon whether the examples
in that subset are hard. So what do you think of that?
- Well, I guess it depends on how we think about hard, right
so it could be that it's hard because some, in some absolute sense right,
or could be that it is hard relative to you know, if we
were to stop now how well we do Yeah, and I mean the latter.
- Oh. Okay. Alright. Well that I feel like
that makes a lot of sense. I mean, certainly when
I'm you know, trying to learn a new skill,
I'll spend most of my energy on the stuff that
I kind, that I'm kind of on the edge of
being able to do, not the stuff that I've already
mastered. It can be a little dispiriting. But it, but
it I think it, I make faster progress that way.
- Right and if you, if you go
back to the example that we, we started with, with spam right?
If you come up with a and you see it does a
very good on some of the data, some of the mail examples,
but doesn't do a good job on the other. Why would you spend
your time trying to come up with more rules that do well
on the email messages you already know how to classify? You should
be focusing on the ones you don't know how to classify. And
that's the basic idea here between, the basic idea hearer behind boosting and
finding the hardest examples.
- Okay. So that answers the first one. We're going to
look at the hardest examples, and I'm going to define for
you exactly what that means. I'm going to have to introduce
at least one technical definition. But ,uh, I want to make
certain you got that. And the second one, the combining,
well that's a difficult and sort of complicated thing, but
at a high level, I can explain it pretty easily
by saying we are going to still stick with the mean.
- We're voting except this time,this time we
are going to do weighted mean. Now why do we want to do
weighted mean? Well I have to tell you exactly how we are going to
weight it but the basic idea is to avoid the certain situations That
we came across when we looked at the data before, when taking an average
over a bunch of points that are spread out, this gives you an
average or a constant that doesn't give you a lot of information about the
space. So we're going to weight it by something, and it's going to turn
out the way we choose to weight it will be very important. But just
keep in your head for now that we're going to try to
do some sort of weighted average. Some kind of weighted voting. Okay?
- Sure. One of the things that's scaring me at the moment though is this,
like I have this fear that by focusing
on the hardest questions, and then, and then
sort of mastering those, what's to keep the
learner from starting to kind of lose track of
the ones it has already mastered? Like how,
why does it not thrash back and forth?
- So that's going to be the trick. Behind the,
the particular way that we do weighting.
- So I will show you that in a moment, and it's going to require
two slightly technical definitions, that we have
been kind of skirting around, this entire conversation. Okay?
Ensemble Boosting Quiz
- Alright so, the, the whole goal of what we're
going to add for boosting here is we're going to, we're going to
expand on this notion of hardest examples and weighted
mean. But before I can do that, I'm going to have
to define a couple of terms. Okay. And you
let me know Michael if, if these terms make sense.
So, here's the first one. The first one is
error. So how have we been defining error so far?
- Usually we take the square difference between the
correct labels and the, what's produced
by our classifier or regression algorithm.
- That's true. That is how we've been using
error when we're thinking about regression error. How about, a
notion of accuracy. About how good we are at,
say, classifying examples. So let's, let's stick with classification formulas.
- Well, that would be the same as
squared areas, except that it's not really meeting the
whole powers [INAUDIBLE] area. That is to say, if
the outputs are zeroes and ones, the squared area
is just whether or not there's a mismatch. So
it could just be the total number of wrong answers.
- Right. So, what we've been doing so
far is counting mismatches. I like that word, mismatches.
And we might call an error raid or
an error percentage as the total number of mismatches
over the total number of examples. And that
tells us whether we're at 85% or 92%, or,
or whatever, right? So that's what we've been doing
so far. But implicit in that, Michael, is the
idea that every single example is equally as important.
So, that's not always the case. Now you might remember
from the first talk that we had. We talked
about distributions over examples. We said that, you know, learning
only happens if you're training set has the same
distribution as your future testing set. And if it doesn't,
then all bets are off. And it's very difficult
to talk about induction or learning. That notion of distribution
is implicit in everything that we've been doing so far, and
we haven't really been taking into account when we've been talking
about error. So here's another definition of error and you tell
me if you think it makes sense, given what we just
said. So, this is my definition of error. So the subscript
D, stands for distribution. So we don't know how new examples
are being drawn, but however they're being drawn they're being drawn
from some distribution, and I'm just going to call that distribution" D", okay?
Right. So H is our old friend the
hypothesis. That's the specific hypothesis that our learner
has output. That's what we think is the
true concept, and C is whatever the true underlying
concept is. So I'm going to define error as
the probability, given the underlined distribution that I
will disagree with the true concept on some
particular instance X. Does that make sense for you?
but I'm not seeing why that's different from number of mismatches in the
sense that if we count mismatches on a sample drawn from D, which is
how we would get our testing set anyway. Then I would think that would
be you know if it's large enough a pretty good approximation of this value.
- So here Michael, let me give you a specific example.
I'm going to draw four, four possible values of X. And when
I say I'm going to draw four possible values of X, I
mean I'm just going to put four dots on the the screen.
- Okay? And
then I'm going to tell you this particular learner output a hypothesis.
Output you know, a, a potential function that ends up getting
the first one and the third one right, but gets the
second and the fourth one wrong. So what's the error here?
- So let's just make sure that, that
everybody's with us. Let's do this as a quiz.
- Okay, so let's ask the students what they
think. So here's the question again. You've output some hypothesis
over the four possible values of x, and it turns
out that you get the first and the third one right,
and you get the second and the fourth one wrong.
If I look at it like this, what's the error rate?
Ensemble Boosting Solution
- Okay, Michael what's your answer?
- It looks like, half of them are right and half of them are
wrong. So, the number of mismatches, is, two out of four or a half.
- Right, that is exactly the right
answer ,because ,you got half of them right
and half of them wrong. But ,it assumes ,that your likely to see all four of
the examples, equally often. So, what if I told you, that, that's not in fact
the case. So ,here's another example of error for you. What if I told you that
each of the points, is, likely to be seen ,uh,
in different proportions and in fact in these particular proportions.
So you're going to see the first one ,half the time.
You're going to see the second one ,one 20th at
a time. You're also going to see the fourth one, one
,four tenths of the time. Alright, so you got it
Michael? One half, one 20th, four tenths and one twentieth.
- Got it.
Ensemble Boosting Quiz Two
- Okay. So, now I have a different question for you. Actually, I have
the same question for you, which is, what is the error rate now. Go.
Ensemble Boosting Solution Two
- Okay, Michael what's the answer?
- Well, it's still a half. But I guess we, we really should
take into consideration those probability. So
the number of mismatches they have, but
the actual number of errors, the expected number of errors is like well,
a 20th plus 20th, so like a 10th. So it's 90% correct, 10% error.
- Right. That's exactly right, so, what's important to see here
is that even though you may get many examples wrong, in some
sense some examples are more important than others. Because
some are very rare. And if you think of error,
or the sort of mistakes that you're making, not as
the number of distinct mistakes you can make, but rather
the amount of time you will be wrong, or the
amount of time you'll make a mistake, then you can
begin to see that it's important to think about the
underlying distribution of examples that. You see. You buy that?
- Okay, so, that notion of error
turns out to be very important for boositng because in the
end, boosting is going to use this trick of distributions in order
to define what hardest is. Since we are going to have learning
algorithms that do a pretty god job of learning on a bunch
of examples. We're going to pass along to them a distribution
over the examples, which is another way of saying, which examples are
important to learn, versus which examples are not as important to
learn. And that's where the hardest notion is going to come in.
So, every time we see a bunch of examples, we're
going to try to make the harder ones more important to get
right. Than the ones that we already know how to solve.
And I'll, I'll describe in a minute exactly how that's done.
- But isn't it the case that this distribution
doesn't really matter? You should just get them all right.
- Sure. But now it's a question of
how you're going to get them all right. Which
brings me to my second definition I want to
make. And that second definition is a weak learner.
So there's this idea of a learning algorithm, which is
what we mean by a learner here. As being weak. And
that definition's actually fairly straightforward
so straightforward in fact that you
can sort of forget that it's really important. And all a
weak learners is, is a learner that no matter what
the distribution is over your data, will do better than chance
when it tries to learn labels on that data. So what
does does better than chance actually mean? Well what it means
is, that no matter what the distribution over the
data is, you're always going to have an error rate that's
less than a half. So that means sort of
as a formalism, is written down here below. That for
all D, that is to say no matter what
the distribution is, your learning algorithm We'll have an expected
error. That is the probability that it will disagree with
it through actual concept if you draw a single sample
that is less than or equal to one half
minus Epsilon. Now epsilon is a term that you end
up seeing a lot in mathematical proofs and particularly ones
involving machine learning. And epsilon just means a really, really
small number somewhere between a little bigger than 0 and
certainly much smaller than 1. So, here what this means
technically is that you're bounded away from 1 1/2. Which
another way of thinking about that is you always get
some information from the learner. The
learner's always able to learn something. Chance
would be the case where your probability is 1/2 and you actually learn
nothing at all which kind of ties us back into the notion of information
gained way back when with decision trees. So does that all make sense Michael?
- I'm not sure that I get this right. Let's, maybe we can do
a quiz and just kind of nail down some of the questions that I've got.
- Okay, sure. You got an idea for a quiz?
- Okay Michael so you, let's make certain that
you really grasp this concept of weak learning okay?
- So, here's a little quiz that I put together to test your knowledge.
So, here's the, here's the deal. I've got a little matrix here, it's a
little table, and across the top. Are
three different hypotheses. So, hypothesis one, hypothesis
two, and hypothesis three. So your entire
hypothesis base consists only of these three hypothesis,
hypotheses. Got it?
- Got it.
- Okay, your entire instance space consists entirely of
only four examples; X1, X2, X3, and X4. Got it?
- Got it.
- I have an X in a square, if that particular hypothesis does not get the
correct label for that particular instance, and I
have a green check mark if that particular
hypothesis does in fact get the right label for that
example. So, in this case hypothesis one gets examples 2, 3,
and 4 correct. But gets example one wrong, while hypothesis
three gets one in four correct, but two and three incorrect.
- I see. So, there's no hypothesis that gets everything right.
- So does that mean that we don't have a weak
learner, because then there's some distributions for which any given hypothesis
is going to get things wrong.
- Maybe. Maybe not. Let's see. Here's what I want you to do. I want
you to come up with the distribution over
the 4 different examples, such that a learning
algorithm that has to choose between one of
those 3 hypotheses will in fact be able
to find that does better than chance. That
is, has an expected error greater than half.
- Okay. Then if you can do that, I want
you to see if you can find a distribution which might not exist,
such that if you have that distribution
over the four examples, a learning algorithm
that only looked at age one, age two and age three would not be
able to return one of them that has an expected error greater than half.
- So greater than half in this case would mean three out of four, correct?
Oh no, no. Oh, you're using, you want to
use that definition that you, that actually took
into consideration the distribution.
- Exactly. That's the whole point. If you, if you always need to
have some distribution over your examples to
really know what your expected error is.
- Alright. And if there is no such evil distribution, should I just fill in
zeroes in all those boxes? Yes, all zeros means no such distribution. You can do
it in either case. So if you put in all zeros you're saying no such
distribution exists. But otherwise it should add
up to one down each of the columns.
- It had better add up to one.
- Alright, I think I
can give that a shot. Okay, go.
- Okay Michael, you got answers for me?
- Yeah, I think so. The first thing I notice is that if I put
equal weights on all four examples, like
I, I decided that instead of solving this
problem by thinking, I would just try a couple examples, and see if I found
things in both boxes. So, if I put equal weight on X1, X2, X3, X4.
- Then hypothesis one, H1 gets three
out of four correct, that's three quarters.
That's better than a half. So.
- Well done.
- Then I fill that in, in the good boxes, quarters all the way down.
- That's a turtle, 'because it's turtles all the way down [LAUGH].
- No, no, it's not though, it should be quarters all
the way down. I thought you'd may be draw a quarter.
- I, I can't draw a quarter, also I can't draw a turtle obviously but still.
- [LAUGH] Agreed. Alright, good.
- You'd think, anyone, do you think anyone listening to
this is old enough to get turtles all the way down.
- Yeah, that's a great joke. Everybody knows that joke.
if people don't know the joke, then we should pause this thing right now, and
you should go look up turtles all the way down. And then come back. Okay.
- It's a, it's a really great joke if you're computer scientist.
- Yes, and if you don't think it's a good
joke then you should probably be in a different field. Okay.
- What about the evil distribution?
- So then I started to generate. Okay, well what
if, the, the, the issue here is that, because we spread
all the, the, the, the probability out in the first hypothesis
really good. So I said okay, well let me put all
the weight on the first example. The x1.
- Okay. So what did that look like.
- Now h1 did very badly. It gets, it's 100 percent error. And
h2 is 100 percent error. But h3 is 0 percent errors. So so.
- So, so putting it all putting all the weight on x1 is no
good. And if you look x2, x3, x4, they all have the property that there's
always a hypothesis that gets them right.
So I started to think, well maybe there
isn't an evil distribution. And I kind of lucked into putting a half on both
the first and the second one. because i figured that, that
ought to work, but then i realized, oh wait a second that's
an evil distribution, because if you choose h1, h2, or h3,
they all have exactly 50% error on the half a half distribution.
- Very good. So 1/2, 1/2, zero, zero, is a correct answer.
- Now I don't know if there's others. You know, certainly X, putting all
the weight on X2 and X3 is no good, because H2 and H1
both get those. Putting all the weight on X3 and X4 are no
good, because H1 gets all of those correct. In fact we have to
have some weight on X1, right. Otherwise H1 is the way to go.
- Right. So, yeah. No, that's interesting.
So what does that mean in this case?
- What do you mean, what does that mean?
- So what does this tell us about, how
do we build a weak learner for this example?
- So what it
tells us is that since there is a
distribution for which none of these hypotheses will
be better than chance, there is no weak
learner for this hypothesis space, on this instant set.
- Interesting. Now is there a way that
we can, like, okay, so this example has no
weak learner. Is there a way to change
this example so it would have a weak learner?
- Um...I'm sure there is.
- Like if we change that x2, x, h3, if that was a check instead of an X.
- Which one?
- So if we made that a green one, here I'll,
I'll make it a green one. By using the power of computers.
- Woah, special effect.
- So now there's no way to put weight
on any two things and have it fail. I don't
know, my intuition now is that this, this should have
a weak learner. Okay, well, how would we prove that?
- I don't know, but may be we should end this quiz.
- Yeah, I think, we should end this quiz. And leave it as an exercise
to the listener. I'm pretty sure I can figure this out.
By the way, we should point a couple of things here
though, Michael. That one is that, the if it weren't the
case, if we had more hypotheses and more examples. Perhaps an odd
number of them and we have the x's and the y's
in the right places then there'd be lots of ways to
get weak learners for all the distributions just because you'd have
more choices to choose from. What made this one particularly hard is
that you only had three hypotheses and none of
them was, not all of them were particularly good.
- Sure, yeah. I mean if you have a bun-, you can have many,
many hypotheses and they're all pretty bad
then you're not going to do very well.
- Well, it would depend upon if
they're bad on very different things. But you're
right, if you have a whole lot of hypotheses that are bad at everything, you're
going to have a very hard time with a weak learner. And if you have a
whole bunch of hypotheses that are good at
almost everything, then it's pretty easy to have
a weak learner.
- Interesting. Okay, this is more subtle
than I thought. So that's, that's interesting.
- Right. So what the lesson you should take away
from this is. If you were just, to think about it
for 2 seconds you might think okay weak learner. That
seems easy. And often it is, but if you think about
for 4 seconds you realize that's actually a pretty strong
condition. You're going to have to have a lot of hypotheses that
are, many of which are going to have to do good on
lots of different examples, or otherwise, you're not always going to be
able to find one that does well no matter what the
distribution is. So it's actually a fairly strong, and important condition.
Boosting In Code
- All right Michael, so here's boosting in pseudo code. Okay,
let me read it out to you then you can tell
me if it makes sense. So we're given a training set.
It's made up of a bunch of xi, yr pairs. You
know, x is your input and y is your output.
And for reasons that'll be clear in a moment All of
your labels are either minus one or plus one. Where minus
one means not in class or plus one means you're in
a class. So this is a binary classification task. That make sense?
- So far.
- Okay. And then what you're going to do is,
you're going to loop at every time step, let's call it
lower-case t. From the first time step one, to some big
time in the future. We'll just call it capital T and
not worry about where it comes from right now. The
first thing you're going to do is you're going to construct a distribution.
And I'll tell you how in a minute, Michael. Okay, so,
so, don't worry about it. And we'll just call that D
of T. So, this is your distribution over your
examples at some particular time T. Given that distribution, I
want you to find a weak classifier. I want your
weak learner to output some hypothesis. Let's call that eight
sub T. The hypothesis that gets produced to that
time step. And that hypothesis should have some small error.
Let's call that error Epsilon sub T, because it's a
small number. Such that it does well on the training
set, given the particular distribution. So, I'm just rewriting
my notion of error From, the other side of the
screen. So there are times we want you to find
a weak classifier. That is, we want you to call
some weak learner that returns some hypothesis. Lets call
it h of t that has a small error. Lets
call that epsilons of t. Which is to say ,that
the probability of it being wrong that is disagreeing with
the training label is small, with respect to the underlying distribution.
- So just to be clear there, the epsilon could
be as big as slightly less than a half. Right? It
doesn't have to be teeny, tiny. It could actually be,
almost a half. But it can't be bigger than a half.
- That's right. And, and no matter what happens.
Or even equal to a half. but, you know,
we can assume, although it doesn't matter for the
algorithm that the learner is going to return the best one
that it can. With some error. But regardless, it's
going to have, it's going to satisfy the requirements
of a weak learner, and all I've done is
copy this notion of error over to here. Ok?
- Ok. So, you're going to do that and you'll do that a whole bunch of times
steps, constantly finding hypothesis at each time step
[INAUDIBLE] with small error [INAUDIBLE] constantly making new. Distributions,
and then eventually, you're going to output some
final hypothesis. Which, I haven't told you yet how
you're going to to get the final hypothesis. But that's
the high level bit. Look at your training data,
construct distributions, find a week classifier with low error.
Keep doing that you have a bunch of them
and then combine them somehow into some final hypothesis.
And that's high level of algorithm for boosting, okay?
- Okay, but you've left out the two, two really
important things, even the part from how you find we,
weak classifier, which is where do we get this
distribution and where do we get this, this final hypothesis?
- Right, so let me do that for you right now.
The Most Important Parts
- Okay Michael, you've called my bluff. You, you said
I've left out the most important parts, and you
are right. So, I'm going to tell you ,how to
construct the most important parts. Let's start, with the distribution.
So, let's start with the base case, and that
is the distribution, at the beginning of time, D
sub one. So, this distribution, is going to be
over each of the examples and those examples, are ,indexed,
so, over i. And I'm simply going to
set that distribution, to be uniform. So, how
many examples do we have, Michael? Let's pick,
let's pick a letter. Let's call it n.
- Why not, cause we do that for everything else and I'm
going to say that for every single one of the examples they happen one
over int times, that is uniform distribution. Now, why do I do that,
because, I have no reason to believe, for the purposes of this algorithm,
that any given example, is better ,than any other example,
more important than any other example, harder than other example
or anything else. I know nothing. So, see if you
can learn over all of the examples. You with me?
- Yeah, cause I feel like if it
actually solves that problem, we're just done. So [CROSSTALK]
- [CROSSTALK] Yes and, and that's what you always want. But that's
the easy case. So I start out with uniform distribution, that's what you
usually do ,when you don't know anything. But, what are you going to
do ,while your in the middle? So, here's what I am going to
do Michael. At every time step T, I'm
going to construct, a new distribution, Dis of
T plus 1. Okay so, here's how we're
going to construct the distribution at every time step. Okay?
I'm going to create the new distribution, T plus 1 to be E for each example, I.
- to be the old distribution, and times T, times E to the minus alpha T, times Y
suvai, times H of sub-T, of X of I, all divided
by Z sub T. [LAUGH] So that's pretty obvious right? [LAUGH]
So what do each of those terms mean? I mean ,I
know it's intuitively obvious ,even to the most casual observer, but ,let
me just try to explain what each of the parts mean.
So, we know that the D is our distribution and it's some
number, where, over all the examples, it adds up to one.
And it's a stand-in, we know, because I said this at the
beginning, for how important a particular ,example is, how
often we expect to see it. And that's the
trick that we're using with distributions. So, I'm going to
take the old distribution for an example, of, for
a particular example. And I'm going to either make
it bigger or smaller, based upon, how well, the
current hypothesis, does, on that particular example. So, there's
a cute little trick here, we know that h
of t always returns, a value ,of either minus one or
plus one, because ,that's how we define our training set, you
always have a label of minus one or plus one. So,
ht is going to return minus one or plus one for a
particular x of i. Y of i which is the label
with respect to that, example ,is also always going to be plus
one or minus one. And alpha t is a constant, which
I will get into a little bit later just right now think
of it as a number. And so what
happens ,Michael, if the hypothesis ,at time t for
a particular example x of i agrees ,with the
label ,that is associated with that x of i?
- Well, hang on, you say the alpha's a
number. Is it a positive number? A between 0
and 1 number? A negative number? What kind of
number? Does, does it not matter. I think it matters.
- That's a good question. It, it
matters eventually. But right now, that number is
- Positive, alright. So, like, a [UNKNOWN]. Almost
like a learning rating kind of thing, maybe.
- It's a learning rating kind of thing, sort of.
- Alright, so, good. So the y, okay, I see, I see. So, the y
times the h is going to be. 1 if they agree, and minus 1 if they disagree.
- Exactly, so if they both say positive 1, positive 1 times positive 1 is 1.
If they both say negative 1, negative 1 times negative 1 is 1. So, it's exactly
what you say when they agree, that number is 1. And
when they disagree, that number is minus 1. Alpha Sub T,
which i define below, is always a positive number. You can
trust me on this. The error is always between 0 and
always gives you a positive number. And if you don't believe
me, you should play around with the numbers till you convince yourself.
So, that's going to be some positive number.
So, that means when they agree, that whole product
will be positive. Which means ,you'll be raising
e to minus some number when they disagree that
product will be negative which means you'll be
raising e to some positive number. So, let's imagine
they agree. So you're going to be re
raising, e ,to minus some number, what's going to
happen to the relative weight of d sub t of i?
When D agrees
- So, Michael wants us to do a quiz.
Because Michael likes quizzes cause he thinks you like
quizzes. And so, I want you to answer
this question before Michael gets a chance to. So
just to be clear here's the question again. What happens to the distribution
over a particular example i when the hypothesis ht that was output by the
example. Agrees with the particular label, Y-sub-i.
Okay, so we have four possibilities when they agree.
One is the probability of you seeing that particular
example increases. That is, you increase the value of
D-sub-t on i. Or the probability of you seeing that
example decreases. That is, the number d of t
of i goes down, or it stays the same
when they agree or well it depends on exactly
what's going on with the old value of d and
alpha and all these other things. So ,you can't
really give just one of those other three answers.
So those are your possibilites. The other radio buttons
[LAUGH] only one of them is right. And go.
When D agrees
- Okay Michael what's the answer.
- Alright, so you kind of were walking us through
it, but basically if Yi and Ht agree, that means
they're both negative or they're both positive. They're equal
to each other. So when we multiply them together we
get one. One times whatever our alpha thing is,
some positive number is going to be positive. We're negating
that, so it's negative E to the negative something is
something between zero and one. Less than, less than one.
So, that's going to scale it down. So, it looks
like. And you know assuming that everything else goes ok
with, with the way that ,uh, the normalization happens right?
It seems like it could be depends on the normalization.
- So by the way. That's a good point. The the x sub t. Is in fact, what ever
normalization constant you need at time T, in order
to make it all work out to be a distribution.
- Correct. Then not going to, not going to change.
- But is some of them are correct and some of them incorrect, the ones
that are correct are going to decrease. And the
ones that are incorrect are going to increase.
- That's right, so what's the answer to the quiz.
- That's true, it does. That's exactly the right answer.
It depends ,on what else is going on, you're correct. Now.
- But I feel like it should
be decreases, like that's really, mainly what happens.
- That's also fair. The answer is, if this one is correct,
that is they agree, and you disagree on at least some of
them, at least one, one other example, it will in fact decrease.
So I could ask a similar question, which is, well what happens
when they disagree? And at least one other example agrees. Then what happens?
- Yeah, then they, then that should increase. Oh.
- Oh. It's going to put more weight on the ones that it's getting wrong.
- Exactly. And the ones that it's getting
wrong must be the ones that are harder. Or at least that's the underlying
idea. All right, Michael, so you got
it? So you understand what the equation's for?
- Yeah, it look. It seemed really scary at first but it seems you know
marginally less scary now because all that it's
doing, it's doing it in a particular way.
I don't know why it's doing it in this particular way. But all it seems to
be doing is. The answers that it, it had, it was getting wrong... It puts more
weight on those and the ones that its
getting right, it puts less weight on those
and then you know, the loop goes around
again and it tries to make a new classifier.
- Right, and since the ones that its getting wrong are getting more and
more weight but we are guaranteed or atleast we've assumed that we have a weak
learner that will always do better than
chance. On ,ah, any distribution, it means that
you'll always be able to output some learner that can get some of the ones
that you were getting wrong, right.
- So that ties together this, what constructed E does
for you, and connecting it to the hardest examples.
So now, that gets us to a nice little
trick where we can talk about how we actually output
our final example. So, the way you construct your
final example, they way you do that combination in the
step is basically by doing a weighted average. And
the weight Is going to be based upon this alpha
sub T. So the final hypothesis is just the s g n function of the weighted sum of
all of the rules of thumb, all of the weak classifiers that you've been picking
up over all of these time steps Where
they're weighted by the alpha sub T's. And remember,
the alpha sub T is one half of the natural log of one minus epsilon T over
epsilon T. That is to say, it's a measure of how
well you're doing with respect to underlining error. So, you get more
weight if you do well Then if you do less well or
you get less weight. So what does this look like to you?
Well its a weighted average based on how well you're doing or
how well each of the individual hypotheses are doing and then you
pass it through a thresh holding function where if its below zero
you say you know what? Negative and if its above zero you
say you know what? Positive and if its zero you just throw
up your hands and And return zero. In other words, you return literally
the sign of the number. So you are throwing away information there, and
I'm not going to tell you what it is now, but when we go
to the next lesson it;s going to turn out that that little bit of
information you throw away is actually
pretty important. But that's just a little
bit of a teaser. We'll get back to that there. Okay so, this
is boosting, Michael. There's really nothing else to it. You have a very
simple algorithm, which can be written down in a couple
of lines. The hardest parts are constructing the distribution, which I
show you how to do over here, and then simply bringing
everything together, which I show you how to do over here.
- Alright yeah, I think it doesn't seem so bad
and I feel like I could code this up, but I
would be a little happier if I had a handle
on what the, why alpha is the way that it is.
- Well there's two answers. The first answer
is. You use natural logs because you're using
exponentials and that's always a cute thing to
do. And of course, you're using the error term
as a way of measuring how good the hypothesis
is. And the second answer is, it's in the
reading you were supposed to have done. [LAUGH] So,
go back and read the paper now that you've
listened to this and you will have a much
better understanding of what it's trying to tell you.
- You're welcome. I'm about helping others Michael you know that.
Three Little Boxes
- So, Michael, I want to try to convince
you other than the fact that it's an algorithm
with symbols that, it sort of works, at, at
least informally. And then, I'm going to do what I
always do and refer you to actually read
the, the, the text to get the, the details.
But before I do that, I wanted to go
through an example if you think that would help.
- I would like an example.
- Okay. So, let's go through an example. So, here's a very simple
example. So, I got three little boxes on the screen. Can you see them?
they're the same boxes. I've drawn them up here beforehand
because I'm going to solve this problem in only three steps.
- Hey those boxes are really nice, did
you get help from our trusty course developer?
- I did in fact did get help from our trusty
course developer. And when I say help, I mean he did this.
- Oh thanks Push Car.
- Yes Push Car is wonderful. Now what's
really cool about this is that Push Car is
already let you know that we're going to be
able to do this in 3 simple steps. And
I'm going to be able to animate it. Or at least hopefully
it'll look animated by the time, [LAUGH] we're done with all
the editing. So just pay attention to the first box for
now, you have a bunch of data points; red pluses and
green minuses, which is the opposite of what we usually do
Push Car. But either way it's red pluses and green minuses.
[LAUGH] With the appropriate labels and they all live in this,
this part of the plane. By the way, what do you call
a part of the plane? I know you have
line segments, what's like, a sub part of a plane?
- Looks like a square to me.
- Yes it is, but I mean, what do you call them? You,
you don't call it a plane segment, do you? What do you call it?
- A region.
- A square region, fine. So it's a square region on a plane. And we
want to figure out how to be able to correctly classify these examples. Okay, so
that is nothing new there. We just want to be able to come up with
something. So now we have to do what we did like in the quiz is that
we have to specify what our hypothesis space is. So
here's our hypothesis space. So the hypothesis space is the set
Of axis aligned semi-planes. You know what that means?
- Mm, no.
- Well for the purpose of this example it means,
I'm going to draw a line, either horizontal or vertical and say
that everything on one side of that line is positive
and everything on the other side of that line is negative.
- I see. Okay, good.
- Right. And their axes align because it's only
horizontal and vertical, and they're semi-planes because the positive part
of it is only in part of the plane. Okay, so I'm
going to just walk through what boosting would end up doing with this particular
example or what a boosting might do with this particular example given that
you have a learner. That always chooses between axis aligned semi planes. Okay?
- So let's imagine we ran our boosting algorithm
now in the beginning it's step 1 all of the
examples look the same because we have no particular reason
to say any are more important than the other, any
are easier or harder than the other. And that's just
the algorithm we had before We run through and we ask
our learner to return some hypotheses that does well in
classifying the examples. It turns out that though there are many,
and in fact there are an infinite number of possible
hypotheses you could return. One that works really well is one
that looks like a vertical line that separates
the first two data points from the rest.
- That is what I was guessing.
- Of course it was. And what I'm saying here is that everythign to
the left of this line is going to be positive and everything to the right
is going to be negative. So if you look at this what does this hypothesis do? So
it gets correct, correctly labeled positive. The two pluses to the left. Right?
- And it gets correct all of the minuses as well.
- Right? But it gets wrong the three pluses on the
right side. So it gets, this wrong, this wrong, and this wrong.
- Right, the Three Plusketeers.
- Exactly. [LAUGH] The Three Plusketeers. That's
actually pretty good. So I'm just you know
I'm just going to ask you to trust me here but it turns out that the specific
error here is 0.3 and if you stick that into our little alpha you end up,
our little, our little formula for alpha, you end up with alpha equal to 4.2.
- That's not obvious to me but.
- Is is See, see, see it's not always obvious.
- Okay. Good. So there you go and that's
just what happens when you stick this particular set in
there. So now we're going to construct the next distribution.
Right? And what's going to happen in the next distribution?
- So the one's that it got right should get less weight and the one's that
it got wrong should get more weight so
those three plusketeers should become more prominent somehow.
- That's exactly what happens. They become, I'm just going to draw them
as much thicker and bigger to kind of emphasise that they're getting
bigger, and it's going to turn out that everything else is going
to get smaller which is a lot harder to draw here. So i'm
just going to kind of leave them their
size, so they sort of get normalized away. Okay?
- I would guess as to what the next plane should be. I think that we should
cut it. Underneath those pluses but above the green
minuses. And that should get us three errors. The
two pluses on the left and the minus on
the top will be wrong. But they have less
weight than the three pluses we got right, so
this going to be better than the previous one.
that's possibly true. But it's not what the learner output.
- Let me tell you what the learner did output though. This learner
output by putting a line to the right of the three pluses, because
he's gotta get those right in saying that everything to the left is
in fact, positive. So, does that seem like a reasonable one to you?
- Well, it does better than half. I guess that's really all what we're
trying to do, but it does seem to do worse than what I suggested.
- Well, let's see, it gets the three
that mattered that you were really, really doing poorly
right but then so did yours. And it only,
and it picks up still the other two which
it was getting right. And it gets wrong these
three minus' which aren't worth so much. So is
that worse than what you suggested? No, it gets
wrong, oh, the three minuses. Oh, it gets correct
those two red pluses on the left. So it gets three things
wrong. So that's just as good as what I suggested. Okay, I agree.
- Okay good. So the area of this step by the way, turns out
to be 0.21 and the alpha at this [INAUDIBLE] step turns out to be 0.65.
So that's pretty interesting, so we got a bunch of these right and a bunch of
these wrong. So what's going to happen to the distribution over these examples.
- Alright, the ones that it, again, the
ones that it got wrong should get pushed up
in weight and which ones are those, those are
the, the three green minuses in the middle patch
- They should become more prominent. The pluses, the three, the three
plusketeers should become less prominent than they were but it still might be
more prominent than they were in the beginning. And maybe because in fact
the alpha, let's see the alpha is bigger so, it will have actually
a bigger effect on bringing it down.
- Yeah I guess so, but it, there'll still be
more prominent than the other ones that haven't been bumped.
- Yeah the ones that you, the, the two, the two
red pluses on the left have, you've never gotten them wrong.
- So they're really going to disappear. So, if we do, If I
do my best to, If I do my best to kind of draw
that you're still, you're going to have. These pluses are going to be a
little bit bigger than the other pluses, but they're going to be smaller than
they were before. The two, the three greens in the middle are
going to be bigger than they were before. But those two pluses
are going to be even smaller, and these two minuses are going to
be smaller. So, what do you think the third hypothesis should be.
- Oh, I like that.
- Okay, so Michael wanted to have a quiz here,
'because Michael again, likes those sort of things and, and
I like to please Michael. So, we came up
with three possibilities, one of which we hope is right.
- And I've labeled them here in orange, A, B,
and C and put little radio boxes next to 'em, so
you could select 'em. So which of those three hypotheses
are, is a good one to pick next? So, A is
a horizontal line that says everything above it should be a plus. B is a,
another horizontal line that says everything above it should be a plus. And C is
a vertical line, like the last two
hypotheses that we found, that says everything to
the left should be a plus. So, which do you think is the right one? Go.
- Alright Michael, what's the answer?
- Alright so of those others, well C
is pretty good, because it does separate the
pluses from the minuses. We, we even liked it so much we used it in round two.
- But it doesn't as good to me as A, because A actually does a good
job of separating the very, the more heavily
weighted points. So I would, I would say A.
- So in fact that is what our little learning system shows.
It shows A. Now, through the trick of animation, I
leave you with A. And that is exactly the right
answer. By the way, Michael, if you look at these
three hypothesis and their weights, you end up with something kind
of interesting. So if you look at this third hypothesis
that's chosen here, turns out they have a very low
error, you'll notice that the errors are going down over
time, by the way, of 0.14. And it has a much
higher alpha of 0.92. Now if you look at these weights and you add them up, you
end up with a cute little combination. So,
let me draw that for you. Okay Michael, so
I cleaned up a little bit so that you could see it. If you take each of
the three hypothesis that we produced, and you weight
them accordingly, you end up with the bottom figure.
- No way.
- That's. Kind of awesome. So what you're saying
is that, even though we were only using half planes,
or, or axis-aligned semi planes, for all the weak learners, that
at the end of the day it actually kind of bent
the line around and captured the positive and negative examples perfectly.
- Right. Does that remind you of
anything else we've talked about in the past?
- Sh. Everything. Nothing. No, I dunno, I mean so with,
with decision trees you can make the shapes like that, and
- That's true.
- And the fact that we're doing a weighted
combination of things reminds me of the neural net.
- Yeah. And it should remind you of one other thing.
- I'm imagining that you want me to say
nearest neighbors, but I can't quite make the connection.
- Well, you recall in our discussion with nearest neighbors,
when we did weighted nearest neighbor. In particular we did weighted
linear regression, we were able to take a simple hypothesis,
add it together in order to get a more complicated hypothesis.
- That's true, because it's local.
- Right, exactly because it's local, and
this is a general feature of Ensemble methods that if you try to look
at just some particular hypothesis class. Let's
just call it H, because you're doing weighted
averages over hypotheses drawn from that hypothesis
class. This hypothesis class is almost all
low, is at least as complicated as
this hypothesis class and often is more complicated.
So you're able to be more expressive, even though you're
using simple hypotheses, because you're combining them in some way.
- I'm not surprised that you can combine simple things to get
complicated things. But I am surprised that you can combine them just
with sums. And get complicated things because sums often act very, you
know, sort of, friendly. Right it's
a linear combination not a nonlinear combination.
- Actually, Michael part of the reason you get something
nonlinear here is because you're passing it through a non-linearity
at the end.
- The sine.
- Yea, that's a good thing, we should, we should ponder that.
- Okay Michael, so we've done our little
example. I want to ask you a quick question
and try to talk something through with you
and then we can start to wrap up. Okay.
- Alright, so, here is my quick question. Now, in the reading, which I know
you've read, there's a proof. That shows that boosting not
only, you know, does pretty things with axis of line semi-planes, but also
that it will converge to good answers and that it will
find good combined hypotheses. You know, we could go look at
the reading and write down a proof that shows that boosting
does well. Umm. And there's one in the reading. Or we
could talk about an intuition. So if someone were to come
up to you. If a student were to find you somewhere
and said, I read the proof, I'm kind of getting it,
but do you have a good sort of intuition about why
boosting pins will do well? What do you think you would tell them?
Could you think of something simple? I've been struggling with this for a while.
- No. [LAUGH].
- Okay, well, then let me try something on you and you can tell
me if it sort of makes sense. So this is just an intuition for why,
for why boosting pins will do well. Okay, so what does boosting do? Okay.
Boosting basically says, if I have some
examples that I haven't been able to classify
well, I'm going to re-rate all my examples. So that the ones
I don't do well become increasingly
important. Right, that's what boosting does. Yes?
- Right, that's what this whole, whole bit of D is all about. It's all about
re-weighting based on difficulty and hardest. And we know that
we have the notion of a weak learner. That no matter what happens for
whatever distribution, we're always going to be able to find
some hypothesis that does well. So, if I'm trying to
understand why boosting in the end, why the final hypothesis
that I get at the end, is going to do well.
I can try to get a feeling for that by
asking, well. Under what circumstances would it not do well? So,
if it doesn't do well, then that means there has
to be a bunch of examples that it's getting wrong, right?
- Mm hm.
- That's what it would mean not to do well, agreed?
- Okay. So how many things could it not get right? How many things could it
misclassify? How many things could it get incorrect?
Well, I'm going to argue Michael, that, that
number has to be small. There cannot be a lot of examples that it gets wrong.
So do you want to know why? Do you want to know my reasoning for why?
- So, here's my reasoning, let's imagine I had a
number of examples at the end of this whole process.
I've done it T times. I've gone through this many times
and I have some number of examples that I'm getting
wrong. If I were getting those examples wrong, then I was
getting them wrong in the last time step, right? And,
since I have a distribution and I re-normalize, and it has
to be the case that at least half of the time,
more than half of the time I am correct, that number
of things I'm getting wrong has to be getting smaller over
time. Because let's imagine that was at a stage where I
had a whole bunch of them wrong. Well, then I would
naturally renormalize them with a distribution so that all of those things
are important. But if they were all important, the ones that
I was getting wrong, the next time I run a learner, I
am going to have to get at least half of them
right, more than half of them are right. Is that make sense?
- It does, but it, but what scares me is, okay, why can't
it just be the case that the previous ones which were getting right
start to get more wrong as we shift our energy towards the errors.
- Yeah, why is that?
- I don't know. But did you wanna,are we, we working up to some kind of you
know, log any kind of thing where each time
you are knocking off half of them and therefore.
- I don't know. Do you remember the proof.
- The proof.
- I mean what goes on is that
you get, sort of, this exponentially aggressive weighting over
- And you're driving down the number of things you get wrong. Sort of
exponentially quickly, over time. That's why boosting
works so well and works so fast.
- I get that we're, the we're quickly ramping up the weights on the hard
ones. I don't get why that's causing us to get fewer things wrong over time.
So like, when you should, in your, in your example that you worked through, that
had the error in the alphas and the errors kept going down and the alphas
kept going up.
- Like, is that necessarily the case?
- Well, what would be the circumstances under
which it isn't the case? How would you
ever go back and forth between examples? Well,
certainly it's the case that if you keep getting
something, right, you will, get them. Well, so
here's what happens over time, right. Is that over
time, every new hypothesis, it gets to get
a vote, based upon how well it does on
the last, difficult let's say, distribution. So the
ones that are even if the ones that
you were getting right you start to get
wrong, they are going to, if you get them
increasingly wrong, that error's going to go down and
you're going to get less of a vote, right.
Because e sub T is over the current
distribution. And it's not over the sum of the
voted, over all the examples you've ever seen.
- So does that make sense? Is that right?
- I don't know. I don't have the intuition, it seems like
it could be, you know, could we keep shifting the distribution. It
could be that the error is going up. Like if the air
could be low, why can't we just make it low from the beginning.
- Like, I feel like the arrow should be going up,
because we're, we're asking it harder and harder questions as we go.
- No, no, no, because we're asking it harder and
harder questions, but even though we're
asking it harder and harder questions, it's
forced to be able to do well on those hard questions. It's forced
to, because it's a weak learner. I mean that's why having, being able
to always, that's why having a weak learner is such a powerful thing.
- But why couldn't we like on, on
iteration 17, have something where the weak learner works
right at the edge of it's abilities, and
it just comes back with something that's a half
- That's fine. But it has to always be able to do that. If it's
a half minus epsilon, the things it's getting
wrong will have to go back down again.
- No, no I understand that. What I'm saying
is that, why would the error go down each iteration.
- Well, it doesn't have to, but it shouldn't be getting bigger.
- Why shouldn't it be getting bigger?
- So, imagine, imagine, imagine the case that you're
getting, right. You, you are working at the edge of
your abilities. You get half of them right. Roughly and
half of them wrong, the ones you got wrong would
become more important, so the next time round you're going
to get those right, versus the other ones. So you could
cycle back and forth I suppose, in the worst case,
but then you're just going to be sitting around, always having
a little bit more information. So your error will not
get worse, you'll just have different ones that are able to
vote on that do well on different parts of the space.
Right? Because you're always forced to do better than chance. So.
- Yeah but that, that's not the same as saying
that we're forced to get better and better each iteration.
- That's right, it's not.
- So it's, yeah again, I don't see that, that property just falling out.
- Well, I don't see it falling out either, but then
I haven't read the proof in like seven, eight, nine years.
- Well, I feel like it should be, it should be something like, look we
had, look at what the, so okay, so we generate a new distribution, what is
the previous, what's the previous classified error
on this distribution, it better be the case.
I mean if it were the case that we always return the best classifier that
I could imagine trying to use that but.
- Well we, well we don't, we don't require that.
- Yeah, I mean, it's just finding one
that's epsilon minus, or a half minus epsilon.
- Right, so let's, let's see if we can take the simple case,
we got three examples, right, and you're bouncing back and forth and you want
to construct something so that you always do well on two of them.
And then poorly on one, kind of a thing, and that you keep bouncing
back and forth. So let's imagine that you have one-third, one-third, one-third,
and your first thing gets the first two right and the last one
wrong. So you have an error of a third. And you make that
last one more likely and the other two less likely. Suitably normalized, right?
- So now, your next one, you want to somehow
bounce back and have it decide that it can miss,
so lets say you missed the third one. So you,
you get the third one right. You get the second
one right but you get the first one wrong. What's going
to happen? Well, three is going to go down. You're still going to,
well you won't have a third error actually. You'll have less than
a third error because you had to get one of the ones
you were getting right wrong, you had to get the one
you were getting wrong right. So your error is going to be
at least an example I just gave. Less than a third. So,
if your error is less and a third, then the weighting goes
up more. And so, the one that you just got wrong
goes up, doesn't go back to where it was before. It
becomes even more important than it was when you had a
uniform distribution. So the next time around, you have to get
that one right, but it's not enough to break a half.
So you're going to have to get something else right as well,
and the one in the middle that you were getting right
isn't enough. So you'll have to get number three right as well.
- Right? And so, it's really hard to cycle
back and forth between different examples, because you're exponentially
weighting how important they are. Which means, you're always
going to have to pick up something along the way.
Because the ones that you, coicidentally, got right two
times in a row. Become so unimportant. It doesn't
help you to get those right. Whereas, the ones
that you've gotten wrong, in the past. You've got to,
on these cycles. Pick up some of them in order to get you over a half.
- And so, it is very difficult for you to cycle back and forth.
- And that kind of makes sense, right? If
you think about it in kind of an information gain
sense, because what's going on there is you're, you're
basically saying you must pick up information all the time.
- Hm. And then your non uni. Well uniform is
the wrong word but you are kind of. You know,
non-linearly using that information in some way. So that kind of
works. It makes some sense to me, but I think that
in the end what has to happen is you. You, there
must be just a few examples in a kind of weighted sense
that you're getting wrong. And so if I'm right, that as
you, as you move through each of these cycles, you're weighting
in such a way that you have to be picking up
things you've gotten wrong in the past. So in other words,
it's not enough to say, only the things that are
hard in the last set, are the ones that I
have to do better. You must also be picking up
some of the things that you've gotten wrong earlier more
than you were getting right. Because there's just not enough
information in the one's that you're getting right all the
time, because by the time you get that far along,
the weight on them is near zero and they don't matter.
- And then if you say, well, Charles,
I could cycle back by always getting those wrong,
yes, but then if you're getting those wrong, they're going to
pull up and you're going to have to start getting
those right too. And so, over time, you've gotta not
just pick out things that do better than a
half. But things that do well on a lot of
the data. Because there's no way for all of the
possible distributions for you to do better than chance otherwise.
- Okay, Michael, so that was a great conversation, what have we learned?
- Alright, well we talked about ensemble learning
which was the idea of instead of just
learning one thing, if it's good to learn
once, it's even better to learn multiple times,
- In multiple ways.
- The simple version that we concentrated on first was
this notion of bagging. Where what we did is instead
of just learning on the whole data set, we would
sub-sample bunch of examples from the training set, different ways,
and train up different classifiers or different learners on each
of those and then merge them together with the average.
- Okay, so if I can summarize
that, we learned that ensembles are good. [LAUGH]
- We learned that even simple ensembles like bagging are good.
- We talked about the fact that by using this kind
of ensemble approach, you can take simple learners or simple classifiers and
merge them together and get more complicated classifiers.
- Mm, yeah, so we can take. We
can. Combining simple gives you complex. Anything else?
- And we talked about the idea of boosting where you
can Oh, maybe this is why it's called boosting. You can take
something that has possibly very high error but always less than
a half, and turn it into something that has very low error.
- So we learned that boosting is really good.
And, we talked a little bit about why, that's
good. By the way, there's a whole bunch of
other details here too, right? Boosting also has the
advan, as does bagging Not only has these little
properties you've talked about before, but it tends to be
very fast. It's agnostic to the learner. As you
noticed, that in no time, did we say, try
to take advantage of what the actual learner was
doing. Just that it was, in fact, a weak learner.
- So I think that's important. It's agnostic.
- Meaning you can plug in any learner you want?
- Yeah. So long as it's a weak learner. So there's something
we learned about. We learned about weak learners that we defined with that
meant. And, we also talked about ,um, what error really, really means.
With respect to some kind of underlying distribution. What do you think Michael?
- That seems like useful stuff.
- These are useful stuff to me. I'm going to throw one more
thing at you, Michael, before I let you go. Okay, you ready?
- Here's a simple fact. About boosting that
turns out in practice. You know our
favorite little over-fitting example. Do you know
how over-fitting works? You have a training
line that tends to get better, and better,
and better. Maybe even going down to zero error. But then you have test
error Which gets better and better and at some point it starts to get worse.
- And at that point you have over fitting and I think,
Michael, you asserted it at some point or maybe I asserted that
,you always have to worry about over fitting. Over fitting is
just the kind of fact of life. You got to come
up with ways to deal with it or sort of over
believing your data. Well, what if I told you that in practice
When you run boosting, even as you run it over time
so that your training error keeps getting better and better and
better and better, it also turns out that your testing error
keeps getting better and better and better and better and better and
better and better.
- That seems too good to be true.
- It does seem too good to be true. It turns out it's
not too good to be true. And I have an explanation for it.
- Tell me.
- Not until next time.
- alright, see you then.
- See you then. Bye.