Hi again Michael.
- Hey, how's it going?
- It's going pretty well. We are ready to do our
last lesson of this mini course
on unsupervised learning and randomized optimization.
- Cool, what's it about?
- It is about feature transformation. As opposed to feature selection.
- Cool. So they can change from vehicles into robots, I assume.
- yeah, typically. Usually it's from robots
into cars, but that's a technical detail.
- Okay. So I'm going to define feature transformation for you, or at
least I'm going to do my best to. It turns out that it's
a slightly more slippery definition than the one that we used for
feature selection but it has a lot of things in common. Okay?
- And you tell me if this makes sense.
Okay, so feature transformation as opposed to feature selection which
we talked about last time is the problem of
doing some kind of pre-processing on a set of features.
In order to create a new set of
features. Now typically, we expect that new set of
the future to be smaller or in some way
more compact. Okay? But while we create this new
set of features, we want to explicitly retain as
much information as possible, and when I say retain
as much information as possible, I probably mean, I
think we'll, we'll see as we continue this conversation.
Information that is relevant and hopefully useful.
Now, the first thing you might ask is,
what's the difference between this and feature selection.
Is that a question you might ask, Michael?
- I was thinking about that, though I think you kind
of told me at the beginning of the feature selection lecture.
- well, I told you that I was going to
to tell you. I'm not sure I actually told you.
- So one thing you might say Michael, is that if I
looked at this definition, this seems to actually be consistent with feature
selection as well. I'd take a set
of features, a feature selection. Then I'd create
a new feature set which happens to be smaller. And my goal was to retain
as much information as possible and we
talked about the difference between relevant features and
useful features, but really this sort of
describes feature selection as well. You see that?
- Yeah, definitely.
- Now the difference is, in fact probably the right way to
think about is this that feature selection is in fact a subset
of feature transformation. Where the pre-processing you're doing is
literally taking a subset of the features. Here, feature
transformation can be more powerful, and can be an
arbitrary pre-processing, not just something that goes from a set
to to a subset. But what we're going to do
is we're going to restrict our notion of future transformation to
what's called linear future transformation. So, let me define
that for you explicitly. So as before with the feature
subset, we said that we were taking a
bunch of features or a bunch of instances that
were in some individual feature space and transforming
it into another feature space of size M. Yup.
- And in the, the feature selection problem, m was meant
to be less than N, hopefully much less than N. And that's
typically the case of feature transformation, though as we'll discuss towards
the very end, it doesn't have to be. But when I say
usually we expect M to be less than N, usually means almost always.
- [LAUGH] Okay. And that's because of
the cause of Dimensionality problem. But the difference
between what we were doing with future selection
and future transformation, because this is exactly what
I wrote before, is that this transformation operator
is usually something like this, in linear transformation
operator. So the goal here is to find some matrix P, such that I can project.
My examples my points in my instance
space, into this new subspace, typically smaller
than the original subspace. And I'll get some set of new features which are
combinations in a particular linear combinations of the old features. Does that
make sense? I think so. So then that p matrix would want to be N by M.
- So the transpose would be M by N and
that multiplies by the N dimensional feature space in X
and projects it out to an M dimensional feature space.
- Okay. Yeah. Pretty good. So if we wanted to write that
out. We could say that feature selection was about taking features like
X1, X2, X3 and X4 and finding a subset like X1 and X2.
And that would be feature selection. But feature transformation would be taking
something like taking X1. X2, X3, and X4, and translating
into something like 2X1 plus X2. Which creates
a single new feature. Which is a linear combination
of the subset of the. Does that make sense?
- Yeah, but it could actually make other feature
combinations as well, right. It's just projected down to one.
- Right, it could be projected down to two, or it could be projected
down to three Or could even project it out to a different for.
- And in principal you could imagine that you could even project
up into other dimensions which is
something that we've done before. Conceptually, anyway.
- When we talked about kernels?
- Yeah, although those were typically non-linear transformations implicit
in the notion was doing a non-linear feature transformation
but we even did it before we even learned
about kernels. Very second thing I think we did. Perceptrons.
How do perceptrons go into a higher dimensional space.
- Well, when we talked about XOR.
- Oh, right.
- What we effectively did was we showed that we could
project the original. Two dimensional space into what looks like a three
dimensional space where the third dimension was a combination of the first
two. And then you could actually do it with a linear separator.
- But that wasn't a linear
transformation, that c was a nonlinear transformation.
- Right. Because we were talking about Boolean verbs.
- But, in the end of the day it
was still a kind of feature transformation and in this
case a feature transformation where we went to a higher
number of features. And today, what we're going to be talking
about is linear transformations as opposed to non-linear transformations.
And we're going to be focusing specifically on the case
where we're trying to reduce the number of dimensions. So
the implicit assumption here, right, the kind of domain knowledge
that we're bringing to here or the belief that we're bringing here is that. We
don't need all N features. We need a much smaller set of features that'll still
capture all the original information, and therefore,
help us to overcome the curse of dimensionality.
It's just that that smaller subset might require
bringing in information across the various features. Okay?
Okay, Michael, so you might ask me why we would
do feature transformation, and I think the answer is obvious
even to the most casual observer. We actually just went
through a couple of examples in the previous slide, like X
or the kernal methods, all these things we've done before
where effectively we were doing feature transformation. In fact, you
could argue that things like neural networks. Already doing feature
transformation. So we've been doing feature transformation all along, and so,
the real question here is, is there some specific
reason why we should do this feature transformation explicitly and
separate from some specific learning algorithm that is doing its
own feature transformation such as a perceptron or a neural
network, or support vector machines. So i'm just going to very
quickly go through a simple example that I hope illustrates
why you might want to do this, but it's still
kind of a large relevant example okay. You with me?
- I'm ready.
so here's the problem, I'm going to describe, and I'm not going to do it
in a lot of detail. I, I really just want to give you an
intuition. That problem is inforfmation retrieveal.
So in particular, I'm talking about what's
called, ad hoc retrieval. Do you know what the ad hoc retrieval problem is?
- I assume that if you start with a
key, and you add hoc to it, you get hockey.
- No, that would be a kind of feature transformation.
We would call that the pun transformation.
The ad hoc retrieval problem, maybe I can
help you out by calling it by another
nickname, it's the Google problem [LAUGH] which is,
you have a whole bunch of documents in
some database somewhere. This is me drawing a
database and you want to retrieve exactly the subset
of documents that are relevant to some query.
So this is me drawing a data base. So I
might type in a word like machine-learning, into some text
box somewhere, click on some button, and then get back
a list of relevant documents nicely sorted against the white background.
- [LAUGH] And this is the ad hoc query problem, the ad hoc retrieval
problem. I give you some features, which in this case are words, and you return
objects, in this case documents, that are somehow relative to the,
relevant to those features. Okay? So we do this all the time.
- What makes it ad hoc as, as opposed
to what? There's various other kinds of information retrieval problems,
this one's just called that ad hoc one. What's ad
hoc about it is you don't already know what the
queries are. You aren't doing something like clustering the
documents. It's just that you're going to be given some unknown
query, which is in some feature space, and you're just
going to have to retrieve the things that are relevant so.
Its just always been called the ad hoc problem as far as I
know. I guess because you don't know what these are going to be beforehand.
- Yeah, and like as opposed to maybe some kind of standing query, like
I want any information that talks about
the OMS program, you know, please alert me.
- Right, exactly, and that's actually got
a name too, it's not the standing query
problem, but it's a kind of triage problem,
I can't remember what it's called right now.
But yeah, so there's different versions of this. This one is in some
ways the hardest, because you can't do a lot of very clever things up
front, in order to figure out the best way to store your documents,
because you don't know what this ad hoc query is going to be. Okay?
- All right. So. Why is this problem
hard, other than by saying I don't know what
this is going to be? So let's think about
it in terms of the features that you have.
What Are Our Features
Okay Michael, so what are our features? When we
have a bunch of documents let's say full of words?
- I'm not sure, it could be things like the numer of words in the document?
- That could be, but what sort of,
and that's actually perfectly reasonable thing but what's the
simpliest set of features that you might start with
form which you might then compute more interesting features?
- Well there's a super, duper big set of features which are the words.
- Right, so in
fact that is exactly what we have, is we have words.
Maybe we have punctuation and, you know. Words are sort of the
obvious thing to do. I typed in machine learning, why don't
I just return every single document that has machine followed by learning.
- Maybe you should.
- Maybe I should and maybe that's a
perfectly reasonable thing to do. In the case of
machine learning, should I return documents that contain
the word Machine but don't contain the word learning?
- I wouldn't score them very highly.
- I might not score them as highly, but I might still return
them as being at least more relevant
than documents that contain neither Machine nor Learning.
- Right, that are just about turtles, say.
- Right exactly, although it's turtles all the way down Michael.
- I see.
- OK. So if our features are words, which are the sort of the most
obvious things to get at, we can compute
other things like the number of words or
whatever. But basically our features are going to be
words or counts of words or something like
that. As it's sort of a. Reasonable first
step. And, in fact, the very early retrieval
systems, which, you know, predate both of us, actually Michael,
used simple things like words. Now, there's a lot of
way, details to this you could imagine. Like maybe you'd
want to transform all plurals into their singular version, get
rid of words like the. And there's a bunch of
complicated stuff you might want to do. But, it's not particularly
relevant for this discussion. So, just assume whatever you want
to assume about the kind of words that you have. Okay?
- Okay. So, what's the problem
with using words? Can you think of any problems with using words?
- Well, there's a lot of them.
- Right. That's actually the first thing.
There's a lot of words. Which means there
are a lot of features. Which means
the curse of dimensionality is going to hurt us.
- I would think that they'd be
pretty good indicators of meaning, except I guess
there's kind of two complimentary problems. One is
that some words mean more than one thing.
- Mm. I like that. So we have good indicators.
Words are good indicators of meaning because, you
know. The words, but, you could be in the
case where you have words mean multiple things. You
said there were two problems, what's the second one.
- Sort of the opposite, which is, you
can say the same thing using completely different words.
- yes. So that's that many words mean the same thing. In
particular, words, the fact that words
can have multiple meanings. It's called polysemy.
The fact that many words can mean the same thing is called synonymy. So, can
you think of a word that might have
multiple meanings? That indicates the problem of polysemy?
- Well, so, you know, learning, in the machine learning example,
learning. Sometimes refers to, this statistical process that we've been talking
about. But it also refers to the thing that, people do.
And in some, in some scenarios, it actually means to teach. Like,
I'm going to learn you something.
- That's true, but that's just too on point.
I'm going to pick something else. That I think you
will appreciate at the end. Let's think of a
word like car. So car has multiple meanings, Michael.
- Can you think of them?
- I'm mostly stuck on one at the moment. Unless you're thinking of.
- Which one is that?
- I'm thinking of the vehicle that you drive around.
- But I guess one could also be referring to
a list deconstruct or in LISP.
- Yes, exactly. It's the first in what are those things called?
- CON, CON cells?
- CON. Thank you. Right, so car is
either an automobile or it's the first element
in a cons cell, which all of you people who've heard of lisp knows is an awesome
reference. [LAUGH] For those of you who've never
used lisp, which is clearly the greatest language every
written in all of history, or that ever
will be written. Because it is a superset and
subsumes all other languages, including natural languages. Imagine this
word, instead, is apple. And so apple can refer
to a fruit, or it can refer to a
computer company, or to a music company for that matter.
- But, I prefer car. So, car can mean
automobile, or it can mean the first element in
a cons cell. If you're using Lisp. Now sticking
to this car example synonymy is a similar problem
in that car and automobile often refer to the same thing. So
you see this. So polysemy is a problem multiple things and synonymy
is a problem meaning the same thing. And a paticular word like
car in this case. Can cause both polysemy problems and synonymy problems.
- Yeah, I can see that. So in the
example you gave before about machine learning, you would,
if you'd just returned documents that had machine and
learning right after each other, then you'd miss all
the stuff on, for example, data mining. Right, in fact, that's
a very good example so, there's a huge split in the community
between those who care about data mining and those who care about
machine learning. And often the people in one camp don't believe the
people in the other camp are doing what they're doing. But for
the person who isn't religiously commited to one of those camps or
the others, when you talk about machine learning you probably also care
about data mining. When you talk about data mining you probably also
care about what people inside the field might call it instead
of machine learning. And so you would be missing out on
a whole swath of papers or interetinst discussions if you happend
to put in the word machine learning. But similarly if you put
in the word like data mining you might end up getting
all kinds of documents that are about the data or about the
data than you get when you literally mine for ocal. Who
knows? And it would cause you huge problems for getting the exactly
relelvant set of documents that you want. So we
can actually talk about this in terms of, the
kind of errors that you get in say supervised
learning, what kind of errors do polysemy give you, Michael?
- False, well there's false positive and false negative and
this one of the things where we, its going to return
things that aren't relevant, its going to say they're positive when
they're not. So you've given me the answer, false positive. Okay.
- Looks like false positive.
By contrast synonymy gives you what?
- True positives.
- No wait, I negated the wrong word, false negatives.
- In other words it's going to, wait is that right? Does the, so false
negative means, oh it's going to tell you
something's not there, when actually it really is.
- Right, so
- Typing in car will not just give me the automobile
articles about my Tesla, which is a black P85. With black leather,
[LAUGH] fully loaded and is like the greatest car on
the planet, but it will also give me really awesome, but
in this case not what I'm looking for, documents about LISP.
Meanwhile, when I type in car, I will actually get all
these things on automobiles. But I won't get articles that
talk about automobiles without actually using the word car. So you
can imagine that these sorts of problems come up all the
time. You've got a set of features, in this case words,
which have this problem that although they're good indicators. They
are insufficient, that is we have a set of features
that will generate false positives and false negatives on their
own and more to the point, doing feature selection will
not solve this problem. I can throw away a bunch
of irelivent words and even useless words, but I am
still going this problem of generating false positives. For our
polysemy and false negatives, or synonomy. And this goes beyond
simply, information retrieval and text retrieval into
any generic problem where you have possibly
a large set of features that have
this problem with false negatives and false positives.
Words Like Tesla
So, what we're going to get out of feature transformation is
the ability to combine these features, these words, somehow, into
some new space where hopefully we will do a better
job of eliminating our false positives and false negatives by combining
words together. So, we're going to leave this example for now,
but just to give you an intuition about how a proper
kind of feature transformation might help you just let me
point out that if I type in a word like car,
you would expect, since I know what car kind of means. Let's say automobiles in
this case, I should pick up documents or
score documents higher if they don't contain the
word car, but they do contain the word
automobile. Or perhaps they contain a word like
motor or race track. Or Tesla, right, that
somehow words like Tesla, and automobile, and motorway,
and anything else you can think of that has
to do with cars probably are highly correlated or highly
related to cars in some way. And so, it would
make sense to combine words like automobile and car together
into new features to pick up documents that are
somehow related together that way. Does that make sense? So
that's the intuition I want you to think of for
the three algorithms that we're going to go over next. Okay?
- Yeah, I could definitely see how synonymy is
going to be a win, when we index the documents,
if we include, well, if we map them down to a
lower dimensional feature space where one feature's used for anything sort of
car related. I'm not sure how that's going to help with polysemy, but
I definitely see how this feature transformation idea could be a win
with synonymy. The way, so that's right, and I think it's, it's
much easier to see how it works with synonomy then with polysemy.
The way it will, it could in principle help you with polysemy
is that it will combine a bunch of features that together eliminate
the ambiguity of any particular word. So as you type in
more words together it'll start to pick those thing up while also
eliminating synonomy. So we'll see. The way it's going to work in practice
actually is that if you think of it. Now we are talking
about unsupervised learning. But if you think of this as a
supervised learning problem, then you can imagine how you can ask yourself
how to combine sets of these features, these words together such that
they can still give you correct labels. And if you can solve
that problem you will end up solving both
polysemy and synonomy. Or at least minimizing their impact.
- OK. So lets go over this specific example to
far more abstract examples and talk about three specific algorithms. .
Principal Components Analysis
Okay, so the first linear transformation algorithm we're going to
discuss, is something called Principal Components Analysis. Okay Michael?
- Now just to be clear here, the amount of time it would take me
to derive Principal Components Analysis and work it
all out in its gory detail would take
forever and so I'm not going to do that,
I'm going to leave that to reading. But I
want people to understand what Principal Components Analysis
actually does and what its properties are, okay?
so, Principal Components Analysis has a long history.
It's a particular example of something called an eigenproblem.
- Okay. It's a particular example something called an
eigenproblem which you either already know what that means
or you don't. And if you don't then it
means you haven't read the material that we've given
you so I'm going to ask you to do that. But, whether you have or have not let
me just kind of give you an idea of what
Principal Components Analysis actually does. And I think the
easiest way to do that is with a very simple two dimensional example. So
here's my very simple two dimensional example.
All right, so you see this picture Michael?
- So this is a bunch of dots sampled from some distribution
that happens to lie on a two dimensional plane like this, okay?
- Now, what, so this is in fact two dimension. So we have two
features here, we'll just call them x and y. We could have called them
one and two, it doesn't really matter, this is just
the x, y plane. And let me tell you what Principal
Components Analysis actually does. What Principal Components Analysis does, is
it finds directions of maximal variance. Okay, does that make sense?
- Variance of what?
- The variance of the data. So if I had to pick a single direction
here, such that if I projected it onto
that dimension, onto that direction, onto that vector.
And then, I computed the variance. Like, literally the variance of
the points that were projected on there. Which direction will be maximal?
- I would think it would be the one, that is
sort of diagonal. It kind of, blobs along that particular direction.
- Right, that's exactly right. And, and to see
that, imagine that we projected all of these points onto
just the x dimension. That's the same thing as
just taking that particular feature. Well, if we projected all
of them down, we would end up with all of
our points living in this space. And when we compute
the variance, the variance is going to be something that captures
the distribution between here and here. Does that make sense?
- Similarly, if we projected it onto the y axis. Which is the equivalent of.
- Those are examples of feature selection.
- Yes, of fea, it's exactly, it's a, it's a
feature selection. It's equivalent of just looking at the second feature
here, y. I'm going to end up comput having a
variance that spans this space between here and here. By
contrast, if we do what you want to do,
Michael and we pick a direction that is about 45
degrees if I drew this right. We would end
up projecting points between here and here. Now, it's not
as easy to see in this particular case but the
variance of points as they get projected onto this line
will have a much higher variance than on
either of these two dimensions. And in particular it
turns out that for data like this which I've
drawn as an oval that you know sort of
has an axis at it's 45 degrees, this direction or axis is in fact the one that
maximizes variance. So, Principal Components Analysis, if it had
to find a single dimension would pick this dimension.
Because it is the one that maximizes variance. Okay, you got it?
- Okay. Now. What's the second thing what's the second component
that PCA or Principal Components Analysis would find. Do you know?
- I don't know what you mean by second. This
is now a direction. That ha, that has high variance.
- The first one.
- because it seems like you know either the x or the y is
pretty high or something that looks just like that red line but it's
just a little bit tilted from it would also be very, very high.
- Right, that's exactly right so, in fact
what Principal Components Analysis does is it has
one other constraint that I haven't told you
about. And that constraint is, it finds directions that
are mutually orthogonal. So in fact, since there
are only two dimensions here, the very next thing
that Principal Components Analysis would do, is it
would find a direction that is orthogonal or we
think about it in two dimensions, as
perpendicular to the first component that it found.
- I see. So there really only, really only is one choice at that point.
- That's right. Or. You know there is two choices
because you, doesn't matter which direction you pick in principal.
Principal Components Analysis Two
So this is called the first component, or the principle component, and this is
called the second, or second principle component
of this space. Okay, does that make sense?
- Now, here's what's interesting about principle
components analysis. You might ask me exactly
how you do this. There are several, several mechanisms for doing it. For those
of you who have dealt with linear
algebra before, something like this singular value
decomposition might be familiar to you. It's
one way of finding the principal component.
But principal components analysis basically has a lot of really
neat properties, so let me just describe some of those
properties to you. The first property is, well, the two
that I've written here. It finds directions that maximize variants and
it finds directions that are mutually orthogonal. Mutually orthogonal means
it's a global algorithm. And by global here I mean
that all the directions, all the new features that they
find have a big global constraint, namely that they must be
mutually orthogonal. It also turns out that you can prove. Which
I will try to give you a little bit of evidence
for. But I'm going to prove formally, that the PCA actually gives
you the best reconstruction. Now what do I mean by best
reconstruction? What I mean is, if you think of each of
these directions that it found, in this case. You found this
one first and found this one second. The first thing you,
I, I hope you see is that if I return these two
dimensions, I have actually lost no information. That is, this
is just a linear rotation of the original dimensions. So if
I were to give you back these two different features,
you could reconstruct all of your original data. You see that?
- Wait. When you say features you mean, what we're
going to give it is, for each of those little black
data points, we're going to say how far along the red
axis is it and how far along the orange axis
- So it really is just a, a, a kind
of a relabeling of the. of the dimensions. Right, so just
like when I think about these points in x and
y space, the original feature space, whenever I give you value
here for this feature I'm just describing how far along
a black dot is on this dimension or this projection. Now
the second dimension or the second feature just tells me how
far along a dot is. On this particular dimension or axis,
and similarly about projecting on the read and on
the orange, I'm telling how far along a point is
along this axis and along that axis. So if I were to return, if I were to take X
and Y and transform them into this new one
and two, I would have given you different values than
I did from X or Y, but they're actually just
the same point. And so I've thrown away no information.
- So that's a pretty good reconstruction.
- That's a pretty good
reconstruction. But what principle components analysis
does for you. Is if I take just on of these dimensions,
in particular, the first one, the principle component, I am
guaranteed that if I project only in to this space
and then try to reproject into the original space. I
will minimize what's called L2 error. So do you understand that?
- I'm not sure let me check. So, you're saying if we instead of,
all right we take the little black dots, they now have a red dimension and
orange dimension one two, and now if we reconstruct using only the first
dimension, I guess it just puts the black dots on the red line.
- Yep. And so there is no, they have no existance in that second dimension.
- And now you're saying of all the different ways
that I could do that to kind of project it to
a, to a linear sequence. This is the one that's
going to have the smallest. L2 error which if im not mistaken
is the same kind of reconstruction error we talked about
in all the other times we talked about reconstrucion. So it's
like squared error. Thats exactly right, its squared error. This
particular notion is called a [UNKNOWN] norm but thats really just
talking about distance. What this means is that if I
project onto this single axis here, and then I compare it
to where it was in the original space, the distance,
the sum of all the distances between those points will actually
be the minimal that I could get for any other projection.
- And you can, you can sort of prove this. It kind of makes sense if you
just think about the fact that points. Always
start out in some orthogonal space. And, I'm basically
finding in scaling and a rotation such that
I don't lose any information. And I maximize variance
along the way. By maximizing variance, it turns out
I'm maximizing or maintaining distances as best I can
in any given dimension. And, so, that gives me the
best reconstruction that I can imagine. Now, you might ask
yourself, Is there anything else nice about principal components analysis
given this reconstruction error? And there's another property of PCA that
is very, very useful. And it boils down to the
fact that it's an eigenproblem. What happens when you do
principal components analysis is you get all of these axes
back, and in fact, if you start out with. N dimensions,
you get back N dimensions again, and the job
here for a future transformation as you might recall, is
you want to pick a subset M of them hopefully
much smaller than N. Well, it turns out that associated
with each one of these new dimensions that we
get. Is its eigenvalue. That eigenvalue is guaranteed to be
non-negative, it has a lot of other neat properties.
But what matters to us here is that the eigenvalues
monotonically non-increase, that is, they tend to get smaller
as you move from the principal to the second
principal, to the third, to the fourth, to the
fifth, to the sixth, and so on to the nth
dimension. And so, you can throw away the ones
with the least eigenvalue. And that's a way of
saying that you're throw awaying these projections, or the
directions, or the features, with the least amount of variance.
Principal Components Analysis Three
okay, wait. So let me see if I can echo some of that back. So, it's almost as
if what we're doing here is we're doing a
transformation into a new space where feature selection can work.
- Exactly, and in fact, here's something kind of interesting for you.
It turns out that if the ion value of some particular dimension
is equal to zero, then it means it provides no information whatsoever
in the original space. So, if I have a direction, if I have
a dimension that has an eigen of zero, I can
throw it away and it will not affect my reconstructioner.
- I'm trying to remember if that makes it irrelevant or useless.
- Well, it certainly makes it irrelevant. If there's no variance, that's the
same thing as saying it has zero entropy, because it never changes. So it's
irrelevant. Now whether it's useful or not, well, probably not, but it might
be useful for something like, our simple,
[INAUDIBLE] example that we used last time.
- Okay. So does this all make sense?
- So one question I have is in the example
that we just kind of worked through there was a blob
of data and when we drew the red line through
the maximum variance direction it went through the. The xy origin.
- Does it have to? Is it necessarily the case that it's going to?
Or you know is the algorithm restricted
to have to put things through the origin?
- Well so my
answer to you is that, that's actually a very complicated
question. And in principle it, so to speak it, it
doesn't really [LAUGH] matter. But in practice what people do
and their doing something like PCA is they actually subtract the
mean of the data. Or the central of the data
from all the data points. So it ends up being
centered around an origin, for the origin. But what this
means is that you can then interprit what we're doing is
finding maximum variants. As capturing correlation.
- Okay. That's helpful.
- And otherwise, if you don't do that, what you end up with is
effectively a principle component that at least
intuitively kind of captures the notion of where
the origin should be, and it's not terribly helpful for what it is we're
trying to do. So my answer is, no it doesn't, and yes it does.
- Okay. So, that's, that's basically
PCA. It's got all these neat properties.
Let's sort of summarize them again. It's, well, actually.
Let me add a couple beyond these. It is a
global algorhythm. It does give you best reconstruction error. Which
seems like a very nice thing to, thing to have.
And, it tells you, which of the. New features
that you get out are actually important with respect to
this notion of reconstruction by simply looking at their corresponding
eigenvalues. They also have a couple of other practical properties.
In particular, it's very well studied. And what that means
in this case is that there are very fast algorithms that
do a very good job, even in weird cases. Even with
large data sets, at least if they're appropriately sparse, of being
able to compute these things. People have been working on this
problem again, since before we were born Michael, and they've gotten
really good at finding principle components even for what would be
very difficult spaces. But this does lead me to a question
though which is another question, which is okay, so
you've got an algorithm that probably gives you the best
reconstruction. But what does it have to do with classification?
This is kind of like the question we had before
about relevance versus usefulness. So we find a bunch
of projections which are relevant in the sense that they
allow you to do reconstruction. But it's not clear that
if you threw away some of these projections, the ones
with low eigenvalue, that even though you'd be
able to reconstruct your. Original data is not
clear that would help you do classification later,
can you see how that would work out?
- Sure, I mean like, it just could be that that's
not where the information is about what the labels ought to be.
- Right, so imagine for example that one
of your original dimensions is in fact directly related
to the label and all the rest are
just, say goush and noise. But the, the variance
of that particular direction is extremely small.
- It might end up throwing it away.
- It'll almost certainly end up throwing it away. Which means
you'll end up with a bunch of random data that doesn't
actually later help you do classification. And that's because this looks
a lot like, if I can do the analogy, a filter method.
- I was going to say that, yeah.
- Oh. So this is kind of like filtering. And in fact it's going
to turn out that the other two items we're looking at too are like
filtering although their particular criterion might be
more relevant. We'll see. Okay? Alright, so do
you understand principal components analysis? Again, I'm not
asking you to understand exactly how you would
run a principal components analsyis algorithm. That
stuff's in, in all the text, but just
to sort of understand exactly what is is
trying to do, what it's trying to accomplish.
- Yeah, I think so. So, it's, it's taking the data, it's finding
a different set of axis, that are just like regular axis in that
they're mutually orthogonal. But it lines up the varience of the data with those
axis, so that we can drop the least significant ones, and that gives us a
way to do. Feature selection. But the
whole thing is a feature transformation algorithm, in
the sense that it first moved the data around, to be able to do that.
- That's exactly right. So, you do, transformation into
a new space, where you know how to do filtering.
- Got it.
Independent Components Analysis
Okay Michael, so the second algorithm that we're going to
look at it just like principle components analysis, except it's
called independent components analysis. Okay, and the major difference is
really the difference between the first word principle and independent.
So the main idea here is that PCA is about
finding correlation. And the way it does that is by
maximizing variance. And what that gives you, is the ability
to do reconstruction. What independent components analysis is doing, or
often called ICA by those in the know, is it's trying to maximize independence.
Very simply put, it tries to find
a linear transformation of your feature space, into
a new feature space, such that each of the individual new features are mutually
independent and I mean that in a
statistical sense. So, you are converting your XI,
your X1, X2, XI... And there's some new features space let's call it I don't
know let's call it a Y, Y1,Y2 YI... Such that each
one of the new features are statistically independent of one another, that is
to say their mutual information is equal to zero. Does that make sense?
this is going to be a linear transformation?
- It's going to be a linear transformation.
- T, to make them statistically independent.
- So, I find some linear transformation here, which is going to take my
original feature space, which I'm representing with
these X's, and transform it into a
new feature space, such that, if I were to treat each of these new
features as random variables and compute their
mutual information, I would get for all
pairs a mutual information of zero. That's part one and
the second thing that it's trying to do is that
it's trying to make certain that the mutual information between
all of the features, y and the original feature space x
is as high as possible. So in other words, we
want to be able to reconstruct the data. We want
to be able to predict an X from a Y and a Y from a X. While at the same time
making sure that each of the new dimensions is in
- fact mutually independent. In a statistical sense.
- I think I'm going to need an example.
Independent Components Analysis Two
So it is important to get an example I, I think
that the, the best way to kind of see an example
here is to draw a little picture for you that tells
you what the sort of underlying assumption behind ICA is. So here's
the kind of, or at least the way I see what
the fundamental assumption of ICA is. And that is this. You're, you're
assuming that the, there's a bunch of. Things in the world.
And these are going to be hidden variables. And they've got some properties
that, that are associated with them. Their random variables. They
are mutually independent of one another. That is if I
know the value of one it doesn't tell me anything
about the value of the other. But we don't know what
they are. And what we get to see are observables. This observable are given rise
to by the value of the hidden variables and they somehow combine in
a sort of linear fashion and our job, the job of any learner
in this case, any unsupervised learner in
this case is given your observables try to
find the hidden variables. Under the assumption that
these hidden variables are in fact independent of
one another. So, what's a concrete example of
that? I'm going to give you a concrete example
of that then I'm going to give you a demo that we found on the web. Okay?
- So, one problem where we know this is true is
something called the blind source separation problem. So what' s the Blind
Source Separation problem? It also has another name which is
the Cocktail Party Problem do you remember the Cocktail Party Problem?
- I think so, yeah.
- So, describe it to me.
- So, not that I go to a lot of cocktail parties, but if you're
in a, in a large group of people like in a cafeteria or something like that.
- And you're trying to listen to a conversation. It can
be very challenging because theres
all these different conversations happening simultaneously.
- And we need
to be able to pull out that one source
that, that we're, that we're caring about, so that
we can listen to it while kind of separating
it out from what all the other noise sources are.
- So what you just described to me is
that somehow you have a bunch of people, so
there's Charles, there's Michael and there's a push car.
And there all talking at once. And let's a imagine
in place of ears we have microphones. And these
microphones are actually recording everything that well everything that
they hear. So this means that this first microphone
is going to hear me. The second microphone is also going to
hear me and so does the third microphone... However,
they're going to each hear me at a slightly different
volume at a slightly different delay. Okay? Because they're
placed sort of randomly around the room. You with me?
- I think so, yeah.
surely, Michael, who just keeps talking and talking and
talking mostly about puns, is also going to be picked up
by each of the three microphones, and Pushgar, mainly complaining
that we aren't doing enough quizzes, is also going to
be heard by the three microphones. In this case if
we look at our other examples, the people are the
hidden variables. That is, they're the things that are causing
events to happen in the world. But what we have
are observable's. Are the microphones, and what it is that
they're recording. Now, if we're in a small enough room, it
turns out that physics tell us that, even though our voices
are all nonlinear and sound doesn't travel in a very linear
fashion the way that we would like, it turns out
that in a small enough room with all of the reflections
and everything, each of these microphones are well modeled as receiving
unknown linear comment, combination of all of the sources. So this
means that each one of these microphones actually contains a, a
different linear combination of each of our voices. Hey, you with me?
- Yeah, it's interesting.
- Right. And so, really, if I were to give you these three recordings
and I wanted to figure out what say, Michael Lipman was saying, all that
information is here, but I can't extract
Michael Lipman from microphone one, or microphone
two, or microphone three. But given all
of these microphones, I can in fact recover,
just Michael alone or just Charles alone or just
Pushcar alone, because ICA exactly tries to find this
notion of mutual independence without losing any information form
the output. And this model here, of three people
talking, mostly independently, and being recorded by linear combination
by different microphones is exactly the model that ICA
was designed to represent. So, just to sort of
drive this home, let me give you an actual example
that we found on the web that they do exactly this problem. Okay?
Cocktail Party Problem
Okay, Michael, so I pulled up this webpage. We
will provide a link to it, on our own webpages.
And you'll notice there's a really nice copyright here, and
this is all free to use. So I'm not, I'm
not doing anything wrong here. They're going to use Independent Components
Analysis, as a way of recovering, original sounds. So, here
you see that I've clicked onto a police car. Somebody
talking in a commercial and, let's say, this dude here.
And, what it's going to do is it's going to, generate
sounds from each of these three sources that are all
independently generated. And, it's going to mix them. So if I
click on these icons here, of these microphones. See, these
look just like what I drew before. You'll be
able to hear each of their sounds combined together. So
I'm going to put my microphone very close to it so
you can hear it. [SOUND] [FOREIGN] And here is another
one of the microphones. [SOUND]
Okay, so, Michael, could you hear that?
- Okay. So, what I did is, I played two of the, the, the
mix sounds together. And, you know, to
the untrained ear, they sound very much alike.
But I think the, the most important thing is, all three of these sounds are
over on top of one another, and as human beings we actually do a pretty
good job on being able to focus on one of them,
because I don't know, we're designed to do that. But machines have
historically had a terrible time of doing this. And independent components
analysis was the first sort of generic algorithm that was able to
solve this. So, we're going to use ICA now to separate the
sound sources. And as you see, we get these little gramaphone things.
And I'm going to play, each one of these and see if
they did a good job of separating those into the original sources.
[FOREIGN] The guardians
of the electronic stock market NASDAQ
who have been burned by past ethics questions are. [SOUND] Okay,
so wasn't that kind of cool
Michael? I mean we, we took these completely mixed up sounds and
we were able to recover the original, by using independent components analysis.
- That's fascinating. I don't still
understand what it's doing, but its pretty
neat that it could do that, because it sounded pretty mixed up to me.
- Well it was, it was in fact each
one of these three sources, that we eventually heard
were in fact arbitrarily, linearly, combined into each of
these separate microphones. And there's really sort of no way
in which you'd be able to recover the original sounds, because, there's
no reason for you to be able to do that. But people
do it all the time. And now with something like independent components
analysis, you can also do it. And the reason it works with
independent components analysis. And that fundamental
assumption that it's making, is that
whatever is generating in these cases
these sounds, the sources, are statistically
independent of one another. Which happens to be true in that case.
And they're combined together in a way that is a linear combination.
Now there's a lot a details here. And again
when you read the materials you see exactly how independent
components analysis works. But, intuitively all it's doing is taking
this particular model, and by using mutual information directly to
find things that are independent of one another, while
not losing any information, from the observables, it's able to
reconstruct these sounds. And it's able to do this incredibly
quickly, and incredibly well. Under a large number of conditions.
Okay Michael. So let's try to be a little bit
more detailed about, kind of, the mechanics of how you, how
you would make this work. The algorithms for finding Independent Components
Analysis, are in all the reading. But, I do think it's
pretty easy to kind of get lost. If you don't, sort
of, turn these matrices into actual numbers. So, let me just,
sort of, give you a quick example of how we would
do this specific thing here. And see if that helps, okay?
- Okay. So, how would you go about turning
this into a problem that something like, you
know, an actual algorithm that uses numbers could
do. Well, it turns out it's fairly straightforward.
And let's just take this, this particular example
here, with people talking into a microphone. Basically
you create a matrix. Which are samples of
all of the sounds. So, in our original
space, we have, you know, this handsome person here.
We have this other, let's say, similarly handsome but in a
different way person here, and we have this other person, with
my poor attempt to draw hair over here, and they're talking.
Well, what we end up doing is we basically take the sound
wave and we sample it. And what does it mean to
sample it? Well, you know, you represent sound in a computer as
a sequence of numbers, just like you represent pictures as a
sequence of numbers, and in fact, you represent words as a sequence
of numbers. And so we basically turn these sound waves
into a matrix full of values. So, as I described
before, Michael, each one of these microphones. Is actually getting
a linear combination of speech from each of these three
individuals. So if we think of microphone one, it is
seeing some kind of sound wave, which is again a
linear combination from each of these people generating a sound
wave. The same is true for microphone two and similarly.
For microphone three. Now, of course, we're talking about recording
these and we're thinking about computer programs, which means we take
this continuous sound wave and we sample it at a very
fast rate. And what we end up with is a sequence
of numbers that represent the particular pressure that we're getting
in these sound waves. So, these sound waves get converted into
a sequence of numbers and I'm just going to make up some
numbers for this. And now what we have is a matrix
where each row represents a feature in our original feature space.
And by original here, I mean, the feature space that we
see each of our microphones and every column represents a sample. Say
at time0, time1, time2, and so on. Does that make sense?
- Yes. And so, since each of these is a linear combination
of these voices we get here. Our goal is to find a projection
like we did before and the original definition of the
problem, such that if we projected this on to here, we
would end up with a new feature space that corresponds individually
to each of these people. And you could do this with
anything, it doesn't have to be sounds, but if we think
about the. Adhoc query problem that we went through earlier, each
of these would be words, and say their presence of words
would be your features, and each of these columns would be
documents, say. And the goal would be to find
a projection given a set of documents that allowed
us to recover some underlying structure that was useful
for classification, or for information retrieval. Does that make sense?
- Yeah. And can you say, what does
it mean for that sequence to have mutual information?
- so as you recall. Because I know that you listen to Push Cars.
A discussion about mutual information and entropy and so
forth. All mutual information is is a measure of
how much one variable tells you about another variable.
And, the assumption here is that each of these people
talking has no mutual information. That is, they're talking
independently of one another, or the sounds that are coming
out of their mouths, don't allow us to predict
the sounds that are going to come out of another
persons mouth. So it may mean that these people are
talking to one another as we often do, Charles might
be talking to Michael, who's talking to push cars, who's
talking back to Michael, who's talking. To Charles, but the exact
sound waves that we see are actually independent of one
another. So if that turns out to be true, what independent
component analysis is trying to do is trying to recover
features, in this case. It turns out the corresponding individual speakers,
such that their sound waves, or the values
that you see, are statistically independent of one another.
Okay? And that's what this does. But, since
we can always come up with arbitrary projections that
are statistically independent, it's important that whatever our
new transformation gives us. It actually has some relationship
with the original values that we saw. So some
how we want to be able to learn from this,
into this in a way that we don't lose any
information. In much the same way that PCA was trying to
minimize the loss of information, and so we not only want
each of the individual new features, in this case people to
be statistically independent, we want the amount of data that we
get from these people. Compared to the amount of data that
we got originally from the microphones, to actually strongly predict one
another. And if the mutual information between the microphones and our
candidates for the people's voices have high mutual information, then
it means that we haven't lost anything. And so those
two things together, those two constraints. Forces into a situation
where, if this model is true, we construct the original voices.
PCA vs ICA
- Okay Michael, so I'm going to see if you understood that fire hose
of words that I threw at you by giving you a quick quiz.
- Yeah, you're welcome. So, what I have down in the middle here is
a bunch of phrases that might describe
PCA, might describe ICA, might describe both
of them or might describe neither. Okay? And what I want you to do
is check off under PCA each one of these that applies. And check off
under ICA each one of these that applies. Okay. Does that make sense?
- Okay. Go.
PCA vs ICA
Okay Michael you got answers for me?
- Think so.
- Okay good. Alright so lets look at the first one.
Mutually orthogonal. Does that apply to PCA, ICA, both, or neither?
- So, it was one of the defining properties in PCA, so I would say PCA.
- Okay that's fair enough. That is actually correct. What about ICA?
- I don't, I don't know. It's not one of
the defining features. It wasn't, it wasn't even thinking about orthugonality.
- That's right. And in fact ICA by finding independent.
Projections are almost all but guaranteed to find ones that are not mutually
arthugoial, it doesn't care at all. So in fact this is what makes PCA
a global algorithm since it has
this global constraint of mutual arthugoality. Where
ICA really in its definition that cares about that, so this should be unchecked.
- I'm going to put an X there to represent unchecked. Okay, got it?
- Okay. What about mutually independent?
that was how ICA was trying to construct its, I don't
want to say features, yeah I guess the, the transformed features.
- That's right.
- It was trying to create them to
be mutually independent, so I would check ICA in
that case. And, PCA didn't use that language
at all, so I would just not do that.
- Okay. That's fair. I will point out though,
that it turns out that PCA is trying to do
something that sometimes will create things that are mutually independent.
But we'll see that when we answer the next question.
But you're right, PCA does not care about mutually
independents. Okay, what about the third phrase, maximal variance?
- Alright. Again, I feel like this was one of the defining
features of PCA. So it was trying to choose dimensions that maximize variance.
- That's right, and what about ICA?
- In terms of variance? Again, the, the,
there, it wasn't really discussed in that context.
- Right. ICA is specifically interested
in notions of independence, statistical independence. Not inpu,
not in issues of varience. So in fact
ICA does not care about maximizing varients. Now
that we have gotten this far let me point
something out, it turns out that because of
this arthugonal deconstraint this lack of independent contraint,
but this gold and maximizing varience, there are
cases under which PCA happens to find independent projections.
What's really going on here with these three constraints
or, or lack of at least one in this
case, is that PCA is tending to find things
that are uncorrelated. By maximizing variance along arthogonal dimensions is
finding uncorrelated dimensions. And that makes some sens given
what it's trying to do. But that is not the
same thing as finding things that are statistically independent.
Again there is this particular case, there is a specific
case where that does work out, and that's
when all of your data is in fact Gaussian.
- Oh, interesting.
- And why Gaussian? Because it
turns out that the distribution that maximizes
variance is in fact the normal
distribution. That's my interpretation of a normal
distribution. So maximizing variance means that what PCA is doing is it's trying
to find a bunch of orthogonal Gaussians. More or less. Does that make sense?
but, you were saying that it, and it aligns with ICA in that case?
- In that case yes, because it turns out that uncorrelated ends up being
independent under very specific distributions. But that's
a coincidence it's not a normal fact.
- Ha, ha.
- But by the way this, this is probably worth
pointing out here something that at least I think is
kind of interesting here. Which is since ICA is only
trying to find things that are. Independent of one another.
Here's our little symbol for independence. As opposed
to things that are uncorrelated, things that are
[INAUDIBLE], it turns out that whatever it is
PCA is doing, it is not working under
the same underlying model as ICA. Lets think about what ICA is trying to do. ICA
is trying to find a bunch of these
prjections all of which are statically independnet. RIght?
- What happens if I take a whole bunch of statically
independent variables and I add them together? In other words
I create linear combination of them. What am I going to
- get? A bunch of sums of things that are
independent? Right and what does that tend towards in the limit?
- I want to say that the law of large numbers tells us that it turns normal.
- That's exactly right. If I take a bunch of independent variables and
I sum them together, that is, I create a linear combination, I in fact
end up with a gouache. And that is
the central limit theorem. So one argument you might
make is that if you believe in a
world where there are independent causes giving rise to
observables, which is what ICA believes, then whatever
you do, you should not be maximizing variance, becasue
you're guaranting the summing together otherwise independant variables.
Oh, I see. I see, I see. So by
trying to find things that are maximal variants it's trying to
mix together through the central unit theorem all these things that
are independent, so it's, it's, it's, it's, specifically not teasing apart
the independent things, it's trying to smush together the independent things.
- Right. Under certain assumptions
about the distributions of these individual
variables. So another assumption that I see us making is not just
that these variables are independent, but that they are hightly non-normally
distributed. And if that's the case, then ending up with things that
look like gaussians, has got to be exactly the wrong thing to do.
If that assumption holds to be
true. Okay. What about maximum mutual information?
- My understanding of what you describe for ICA said that this is
what it's trying to do. It's trying to find a new feature space
where the features are maximally. The
mutual information between them is large as
possible. So I would check the ICA in that case and not the PCA.
me be Let me clarify soething you said. How can it be trying to
maximize mutual information while also trying to
make things that are mutually independent. I
think you would describe them both in the same language. So what is it
trying to make mutual independent? Different features.
The, the new, the new transform features.
- Right. So each of the new transform features is independent with all the
other new transform features. So what is
it trying to maximize mutual information between?
- Oh, the, I see. The the information that
content of the original features and the new features.
- Right. So this is about joint mutual information between
all the original features together and All of the transform features.
There it's trying to maximize mutual information. But inside the
new transform features it's trying to make them pair-wise mutually independent.
- Alright. I think I said that wrong because I understood
it wrong. So thank you for clarifying.
- You're welcome. But now you understand it right.
- Okay, let's go with that because
you got the checkmark right. What about PCA?
- Just X that. I don't understand what that would mean.
- Right. So if I were to put something here that it
could get a check mark for, what I put down is maximal reconstruction.
- Right. And notice that maximal reconstruction of your
original data is not the same thing as maximizing
mutual information. Thought of course in the limit they work out to be the same.
- Interesting, okay.
- But the one project that maximizes variance is not
necessarily the same as the first projection you find for
maximizing mutual information. So these things really are doing two
completely different things. Okay, last two what about ordered features?
- So, in PCA, it was actually assigning, you know,
taking the maximum variant to dimension first and then the next,
- You know, after that's been subtracted out, whatever
has the largest remaining variance and so forth. So
that the features end up coming out in, in
a very specific order. And it has the property that
you could drop the, the last group of features
if you want to still have as high a
reconstruction area as possible, given the number of features
that you keep. So I would check PCA for that.
- Okay, good. What about ICA?
- You didn't say anything about the ordering or
how you'd actually find features. It seemed like in the
blind source separation example you gave It just, came
out with the three, so I'm going to say not ordered.
- That's right and in fact, if you
think about the blind source separation example, how
in the world would you order people anyway.
I mean, other than in the obvious way.
It just doesn't really mean anything. I say it doesn't
have a notion of causes being more important than other
causes merely that they're independent. So, it doesn't really worry
about ordered features. It turns out in practice That you
can actually try to order the features by using something
called kertosis. Which is the fourth central moment of a
distribution. But that's really just something that's useful in some
specific cases, almost by coincidence. ICA, itself, does not particularly
care, about ordering. At least not classical ICA.
Okay, what about the last one, bag of features?
- So, I. Would view what you just said about ICA as implying that
what ICA produces is a bag of features. There's no particular ordering to them.
- That's right.
- It's just a collection of things that
make up the whole. I guess, you know,
PCA, after you've thrown away whichever features you
don't want. The features that remains are just features.
They could be treated as a bag, I guess. So I don't know if
I would check that or not. I. For symmetry I guess I would say not.
- Okay, but I'm going to say yes because
in ordered set of features is still a
bag of features. But we would accept either.
Either a check or an x, both are sort
of fine. So then, what do we really learn from this, Michael? I think what we've
learned is these things have fundamentally different assumptions
and are really trying to do completely different things.
only thing they have in common is that
they're still trying to capture the original data somehow.
- Alright. I understand that, but I also learned the opposite, which is
that they are really closely related and are trying to do very similar things.
- Yeah. But their underlying models
are different. So maybe that, that's actually
a good point, Michael. So maybe a better way of saying it is,
their sort of fundamental assumptions are
different. Even though they're trying to do
the same thing, which is capture the original data in some new transform space
that is somehow better. But if you think about it that way, there
are two different optimization functions, two different
fitness functions, two different cost functions. So
even though they're trying to do
the same thing. Reconstructing. Keeping the data
around. Their basic assumptions about the way
that data is constructed is very different.
PCA vs ICA Continued
- Okay, so, let me give you just a few more
examples, Michael, of a, how PCA and ICA differ. And I'm going to
do this mainly by talking about certain things that ICA does. And
this is really just for your edification, but I think it really
helps, to think about, those sort of underlying models and how
they differ, by thinking about how they react differently to different kinds
of data. So, just, here's a couple of examples. The one we
covered right away, was the blind source separation problem. And of course,
what, what we recall is that ICA was, in
some sense, designed to solve the blind source separation
problem and in fact ICA does an excellent job
at solving the blind source separation problem. Meanwhile, PCA
does a terrible job at solving the blind source
separation problem and that's just because it's assuming this
kind of Gaussian distribution, it just does not recover
what's going on. But here's something that ICA does
differently from PCA. Is because it's got this kind of blind source root in it,
it's actually directional. And what I mean by
that is, you recall I drew this sort
of matrix before, where we had features this
way. And we had samples of those features
like sound, time samples of sound this way.
It turns out that for PCA, it doesn't
matter whether I give you this matrix or I give you
the transpose of this matrix. It ends up finding the same answer.
- And that should make sense, right?
Because it's just basically finding a new rotation
of the data, and these are effectively
just rotations of each other. For the purposes
of, of, if you just kind of think about it geometrically in space. ICA on the
other hand, gives you completely different answers, if
you give it this versus giving it this.
So ICA is highly directional. And PCA much less
so. But ICA, because it's making this kind of
fundamental assumption, does end up generating some really cool
results. I'm going to give you another example of
one. That's like a blind source separation. In fact,
I'm going to give you two. Okay? So, imagine
you had a bunch of inputs of faces. Okay?
So, here's my input faces. I give you bunches and
bunches and bunches of faces. What do you
think PCA would do? What do you think the
first principle component of PCA would be over
pictures of thousands and thousands and thousands of faces.
- Over all darkness of the image.
- Actually that is exactly right. The first thing that
PCA tends to do with images, we are actual, we
are talking pictures, not just sketches here, is it finds
the direction of maximum variance and that tends to be
brightness or luminous. Which kind of makes sense because
that's typically the kind of thing that kids vary the
most. So in fact, the first thing people often
do when they're trying to use PC on faces
is they normalize all of that away. Because the
first principal component isn't terribly helpful. It's just kind of
giving you the average light. What do you think
the second thing it would find it would be?
- Hair versus not hair?
- No. Interestingly enough. What it ends up finding is sort of the
average face. Which is kind of like the question you asked me earlier
about what happens if you go through the origin.
- So, there's actually names for this. It's called the
Eigen Faces because, you know, as I noted before, it's
an Eigen problem. And it basically finds kind of the
average face. And that's kind of weird. But it works for reconstruction.
I don't know what the average face here would look
like. Probably something like this. Yeah, so you can see
how useful that would be. Not even clear that's a
face. But it works for reconstruction. Do you know what ICA
ends up finding?
- Yes, it finds noses. It finds eye
selectors. It finds mouth selectors. It finds hair selectors.
And, I think, intuitively the way I would think about
this is because PCA is doing this global orthogonality thing, it's
going to be forced to find global features. ICA I didn't
say was global, and that's because it's basically local, it doesn't
care about orthogonality. So it tends to find parts of.
If you feed it the metrics the right way, it tends
to find parts of. And so it ends up finding these
little selectors. So this has a really nice outcome in cases
like natural images or natural scenes. So what do you think
happens Michael, if I take natural scenes, you know, pictures of the
forest and grass and walking around. Just things that I would see
if I were just walking out in the world and taking bunches
of pictures. And I feed it into ICA. Well the answer
for PCA, by the way, is the same as before, is you
get brightness, you get sort of the average image, things that really
make sense if your goal is to do reconstruction. But ICA actually
gives you something different. What do you think the independent
components, the underlying causes, so to speak, of natural scenes are?
- I'm still thinking about the face parts. But, by analogy, it seems like it
would be the, you know, things in the world like, trees, and rocks, and ground.
- Not quite, and that's I think in part because, there are too many of
those things that kind of overlap in
too many different ways. It actually finds something
- That's exactly right. ICA finds edges. Now
for me, that is incredibly satisfying. It says that
the independent components of the world are edges. Now
there are two things that come out of this.
One is just the satisfying feeling you get by
realizing that in it there's a algorithm that, on
it's own, recovers something fundamental like edges. That's very
nice. But the second thing that's nice about it
is, once I use ICA to do my
feature transformation and discover that what it's learning are
edge detectors, well then I can just write edge
detectors. I can just write algorithms that are very
fast, very efficient, that can go through images of
natural scenes and pull out the edges. It's unclear
how to do that with, you know, different principled
projections, but if edges are the fundamental building blocks
of natural scenes then there are people who know how to
write edge projectors very quickly. By the way, you get a
similar result, I won't talk about it, but you get a
similar result in our information retrieval
problem where you have documents. And
you can read this in the material we gave you. But
what ICA ends up giving you for documents are topics. And
they're very easily interpreted that way. Collections of words that select
for specific topics that are themselves made up of collections of words
that select for topics and collections of uncorrelated words
that get rid of distracter documents. So the independent
components of scenes are edges and the independent components
of documents are topics. And that feels very nice, but
in both of these cases, both with edges and
with topics, it turns out that the form of these
topics are something that you can compute very, very
quickly. And sort of independently of the underlying ICA algorithm.
So why does that matter, Michael? Well, remember what I
said originally about unsupervised learning and what it was good
for, was understanding your data. Doing kind of data analysis
from a human being's point of view. So what something
like ICA, and other algorithms we might imagine, do, is
they allow you to do analysis over your data, to
discover fundamental features of them, like edges, or topics or
noses and eyes. And then once you understand that's what's making
up your data, you can simply write code that will select for it, or figure
out what the important parts of your
data are correspondingly. And so here, we haven't
just done this feature transformation problem for
the sake of doing classification, feature transformation actually
helps us to understand the fundamental underlying
causes, or at least structure of our data.
- Okay Micheal. So we've talked about PCA
and ICA. And they both work remarkably well
in the specific domains that they're designed for.
And they've been applied for decades on a
wide variety of problems for doing this sort
of future transformation. But I'm going to just
very briefly describe two other alternatives, and sort
of give you a notion of the space, okay?
- Okay, the first one is kind of irritating,
but I feel obligated to share it with you.
Ad it's called, well it's got many different names, but
I'm going to call it RCA just because I like the symmetry.
And RCA stands for random components analysis. So what do you
think random components analysis does? This is also called random projection.
- I'm going to guess, instead of finding dimensions with,
say, high variance, it's just going to pick any direction.
- That's exactly right. What RCA does is it generates random directions.
- Then I guess it projects the data out into those directions.
- That's exactly right. It's like saying it picks a random
P to a random matrix to project your data onto. This
matrix is, in some sense, just any random linear combination. And
you want to know something? It works. It works remarkably well.
- At what though? In terms
of like reconstruction?
- At reconstruction. Well not particularly well at
reconstruction. But you know what it works really well,
it works really well if the next thing
you're going to do is some kind of classification.
- Now why is it you think that it actually works? Can you imagine why
just picking a bunch of random directions
and projecting onto those random directions might work?
- Well, it does mix things together differently. I don't know why the
original data wouldn't work. Then this would
work better. Unless the original data is
somewhow is purposely made to not work.
- Well remember what we're doing here,
right. We're starting with N dimensions. And we're
projecting down to M dimensions, where M is
significantly lower than N. So, I started with
a bunch of dimensions. Now remember, the real
problem here is not that I can't gather
the data from the N dimensions. It''s that
there's a whole bunch of them, cursor dimensionality.
- So I need to have a lot of data.
So if I don't have a lot of data, at least
certainly not an exponential amount of data, so to speak, and I project a
lower dimension, why would random projections still
give me something that helps with classification?
- So it's, it's as if it's maintaining some
of the information from these other dimensions, even though
there's fewer of them now. They're all kind of
mixed together, but they signal might still be there.
- That's exactly right. And because you project into a
lower dimension, you end up dealing with a cursor dimensionality problem,
which is sort of the whole point of this, or
one of the whole points of this in the first place.
And really, a, a way I think of summarizing what you're
saying is, that this manages to still pick up some correlations.
So if I take random linear combinations of all of my
features, then there's still information from all of my features there.
So in practice, at least in my experience, Michael, it turns
out that M, the number of lower dimensions you project into, for
a randomized components analysis, or randomized projections, tends to be bigger
than the M that you would get by doing something like PCA.
So you don't end up projecting down to sort of the
lowest possible dimensional space. But you still project down to a lower
dimensional space that happens to capture your correlations, or at
least capture some of the correlations, which often ends up
working very well for a learner or a classifier down
the road. You can actually see how, in this case, you
might even project into another set of dimensions M, where
those number dimensions are actually bigger than the number of
dimensions you started out with. This, in some sense, is
almost exactly what we did with perceptons in solving X or.
Basically, you're projecting into higher dimensional spaces
by doing this. Does that all make sense?
- Yeah, I think so. I mean it makes sense
to me that it would be not as efficient in
some sense, as PCA, because it sort of reminds me
of, you know, if I want to, if I want
to paint my wall, I can very carefully paint all
the little pieces of it. Or, I could just splatter
stuff at it. It generally takes more when you splatter
because you're not being as systematic, but it does, ultimately,
cover your wall.
- [LAUGH], Yeah. I think that's an interesting analogy, and I'm going to
go with an apt one. Okay, so, this sort of thing works. What
advantages does it actually have over PCA and ICA? Can you imagine one?
There's one in particular which I think sort of jumps out at you.
- RCA? Well, I don't know. Is that, is that a good quiz question maybe?
- Sure, let's make it a quick quiz.
Okay, so here's the quiz question then. What is
the big advantage of RCA and there is a
single word to describe it. So your job Michael
is to look into my brain and imagine what word
it is I'm thinking of and I gave you a hint by saying it jumps out at you,
which means if you don't come up with it
you're going to feel really bad. [LAUGH] Okay, so you ready?
- Okay Michael, what did you come up with?
- I have a bunch. Well, so, the big advantage of ICR, RCA, it's cheap.
- It is cheap.
- It's simple.
- It is simple.
- It's easy.
- It is easy, in a sort of comical or complexity sense.
- Sort of practically free. Those are all
words like, free, easy. Those are, those are
the words that came to mind. You're saying
that that's, none of those is your word.
- No, none of those was the word that I was coming up with.
- How about, works?
- It did start with the letter f, though.
- Foul mouthed.
- That's two words.
- Alright, I need more letters. I'm
still going back to famous. Fabulous. Fast.
- That's right, it's fast. But I would
give you, definitely give you cheap or easy.
is like fast.
- So is easy.
- [LAUGH] Alright.
- [LAUGH] At least your, that's what the slang term meant
where I grew up. Anyway, okay, so anything that captures fast,
cheap, easy, whatever would do here. But what's irritating about this
and actually things like k means and k and n and
a whole bunch of other algorithms, is that they're so freaking
simple, and yet they manage to work. And that can be
kind of annoying sometimes because, you know, as machine learning people,
we want to come up with these complex things that work,
but as a very brilliant man once said to me,
you must earn your complexity. So, RCA, randomized components analysis,
or really randomized projections, has this very simple way of
thinking about the world. I'll just randomly throw things together,
and it turns out to pick up correlations on the
way and worked pretty well. And its big advantage, other
than working sometimes, is that it's very, very fast. PCA
can take hours in supercomputers, and ICA can take hours upon
hours in even more supercomputers. But
generating a bunch of random numbers, computers
are really good at that, and it happens to go really, really fast.
- So, the second alternative I just wanted
to mention very briefly is something called LDA, which
is short for linear discriminant analysis. So, any
guess on what linear discriminant analysis might do, Michael?
- It'll make, well, linear. So it's got a
linear in it. Discriminant reminds me of, like, classification lines.
- Yes. So, what linear discriminant analysis
does is, it finds a projection that
discriminates based on the label. This actually will feel, this
feels a lot like supervised learning, right? You take advantage
of the label, and you come up with a transformation,
such that you will, in fact, project things into different
clumps or clusters, or otherwise separate them based upon their
label. So, you can imagine using a lot of algorithms,
as we've used in the past. In some sense, what
they're doing is, linear discriminant analysis. They're finding lines or
finding linear separators between different clumps of points. And
if you think about the binary case, for example,
you could think of SVM as doing something like
this, where what you've done is, you transform your points
not into a specific label plus or minus, one
or zero, yes or no, but you instead project it
onto the line such that it would clump things
accordingly. And you use the value of that projection as
a way of re-representing the data. Does that make sense?
- Yeah, and this, this approach seems a
little different from all the other ones, in
that it's actually paying attention to how the
resulting components are going to get used. Right.
- That's exactly right.
- It actually is going to try
to help you, specifically, with the classification.
- Alright, that's exactly right. All three
of the examples that we gave before,
feel almost like filter methods, right? In
that they have some criterion they're trying
to maximize. Even though randomized projections is, who knows
what it's trying to maximize besides randomness. But they
don't care about the ultimate learner, or the ultimate
label that's associated with them. Whereas LDA does care explicitly
about the labels and wants to find ways to
discriminate, finds sort of projections or features that make it
easier for you to do that discrimination. Now it
also does not care about what learner's going to happen next,
but it does care about the label. So
it does end up doing something slightly different, and
works pretty well in a world where you have
very simple things that can be lineraly discriminated. Okay?
- So that's it for the alternatives. I actually think that's pretty
much it for, feature transformation. So why don't we wrap up. [CROSSTALK]
- Just a quick, just a quick question, though. [CROSSTALK] So, LDA, I've
heard of LDA in this context before
meaning something else. Like, latent Dirichlet allocation?
Yeah, but that happened long after linear discriminant
analysis and we haven't moved past the 90s.
- Well, I'm just saying that it does seem.
And I think it is actually an unsupervised approach.
- Oh latent, oh no, absolutely. But it is well
beyond the scope of the discussion that we're going to have here.
- All right. As you wish.
- But it is in fact. Every time you see a D you can think Dirichlet.
- Which is the computer science
pronunciation. Other people pronounce it differently.
- Yes, because the correct way of pronouncing
it is sort of beyond my. It has another
syllable in there. Is it German or something?
- Who is Dirichlet. Is it Dirichlet? [LAUGH] That
sounded German. [LAUGH] Okay, so let's wrap up Michael.
Okay Michael, so we've gone on a journey of discovery. [LAUGH] Through
unsupervised learning, and so my question to you is, what have we learned?
- I think we learned a little bit about ourselves.
- And a little bit about America.
So what A's have we learned today, Michael?
- So there was PCA.
Mm-hmm. RCA and USA.
USA, we're number one! Whoo!
- Okay, we're just going to erase that little bit. [LAUGH]
- Okay, yeah, okay, so we learned about a lot of A's today.
- Which is the same grade that all our students
are going to get, I am sure. That would be great.
that's if they're truly independent. If they aren't independent, then the
central limit theorem says, there will be a normal distribution across grades.
- Ring that bell curve.
- Aw, yeah, baby. Okay. So we learned about PCA,
ICA, LDA and RCA. What else did we learn about?
- Well, I think that was it. But we talked about specifically.
we, we talked in detail about the relationships between some of these.
- In particular, these are all examples of
- That's right. Okay, so we found out about the relationships
between different transformation analysis. Oh,
here's something we learned. We learned
that the A doesn't just stand for analysis. In the algorithms,
but it actually does stand for the analysis of the data.
- Because that's unsupervised learning.
- That's right. And that in particular I
gave some examples where ICA tells you what the
underlying structure of the data is. You can use it to find structure.
So that, for example, the independent components of natural scenes are edges.
So it's interesting, because I feel like the other time that you
emphasized structure was when you were talking about, mimic which was a
piece of work that you did when you were a graduate student.
- One would almost want to guess that maybe
you worked on ICA when you were a graduate student.
- I actually did. My very first paper
as a young graduate student was on mimic and my very last paper
as a young graduate student. My
actual dissertation, was on independent components analysis.
- I had that sense from the number of strong
points you felt the need to make [LAUGH] about ICA.
- Well listen man, really, structure runs my life.
As you know, everything about my life is well-structured.
- Yeah, sure.
- [LAUGH] Okay, did we learn anything else?
- So, yeah, so I mean I feel like we spent a lot of time talking about
so P, ICA is a more probabilistic kind of modeling, method
and PCA is a more I want to say linear algebraic, modeling model.
- That's a really good point Michael. So, we didn't say it explicitly this
way, but actually, even in our own work it often comes up that sometimes,
you want to think about, information theory. You
want to think about probability. And sometimes, you
really just want to think about linear algebra. And you could see PCA as being
really about linear algebra. And sometimes only
coincidentally being about probability. Where as ICA
is all about probability and information theory,
and only coincidentally ever about linear algebra.
- Yeah, that's helpful. That does seem to be a fundamental
split in a lot of work that happens in machine learning.
- Yeah and it, and it makes some sense. I mean, we, we
know what the right answer is in probability, but we know what the
right answer is in linear algebra. And I guess it's, it's often the
case Michael, would you agree that. That the linear algebra approach is often
easier to think about, or easier to do in
practice and sometimes, it can be interpreted as if
it's probability. And that typically breaks down on the
edge cases, but you know, you can kind of
work around it for sort of common cases. Yeah,
that, that the linear algebra algorithms are often cheaper
to implement, cheaper to execute less prone to local
minima issues. There's sort of a well defined answer
that they're, that they're finding. But it's often not quite the
answer that you want, and the probability methods give you the
answer that you want, but can be very hard to find.
Right. And in fact you can see that in ICA and PCA,
in that PCA is very well understood. There are lots of
fast algorithms for it. And you know that the principle components
always exist. Interestingly, we didn't talk about this but by contrast,
ICA with its more information, theoretic and probabilistic roots, has a very
specific model. And it isn't always the case
that that model fits, and so in fact, sometimes
you can't find independent components. [INAUDIBLE] Because they don't
actually exist, except in the most trivial sense. So
it's both more expensive, because of the way
you end up searching the space. And it doesn't
always produce an answer. But when it does produce
an answer, it tends to produce very satisfying ones.
- Well, I think that's a good place to stop, in the
sense that I wanted to know just one more interesting fact about ICA.
And now that I've got that, I feel
fully satisfied. Well, there's another fact I can
tell you which is that it's the only
one of these algorithms that start with a vowel.
- And now I'm more than satisfied.
- It is always my goal to leave you more than satisifed.
- [LAUGH]. Alright, then!
- OK. Well I think we are done
with this entire sub lesson mini course thingy.
- About unsupervised learning.
Well that, well that's great!
- About unsupervised learning.
- What does that, what does that leave us to do? Doesn't lead us to do
anything, at least not with this particular, mini-course.
I think actually the description we had here, what
we've learned for this particular, lesson actually applies,
even going backwards to some of our other
lessons. And what we're going to get to do
next is, decision problems and reinforcement learning. Ooo, exciting.
- It is exciting.
- But I think first people probably have some
homeworky stuff to do, projects to do in the context of this minicourse.
- Yes, and probably an exam of some sort.
- Good luck to everyone on that.
- Yes, we're absolutely sure you'll do fine.
Be sure to go over these lessons and be
sure to read all of the material, because there's
a lot of detail in the material that wouldn't
make a lot of sense for us to cover in this format. But do give it a
read, come back, look at the stuff that we've
talked about, it should help you understand the intuition
behind whats really happening there. Excellent. Well this
is great, thanks, thanks Charles, i learned a lot,
- I did too. Bye Michael i will hear from you soon. Alright, awesome.