[Narrator] So welcome to the class on unsupervised learning. We talked a lot about supervised learning in which we are given data and target labels. In unsupervised learning we're just given data. So here's a data matrix of data items of N features each. There's M and total. So the task of unsupervised learning is to find structure in data of this type. To illustrate why this is an interesting problem let me start with a quiz. Suppose we have 2 feature values. One over here, and one over here, and our data looks as follows. Even though we haven't been told about anything in unsupervised learning, I'd like to quiz your intuition on the following 2 questions: Or put differently do you think there's something to be learned about data like this, or is it entirely random? And second, to narrow this down, it feels that there are clusters of data the way I do it. So how many clusters can you see? And I give you a couple of choices, 1, 2, 3, 4, or none.
[Narrator] The answer to the first question is yes, there is structure. Obviously these data are seen not to be completely random determinants. They seem to be for me 2 clusters. So the correct answer for the second question is 2. There's a cluster over here, and there's a cluster over here. So one of the tasks of unsupervised learning will be to recover the number of clusters, and the center of these clusters, and the variances of these clusters in data of the type I've just shown you.
[Narrator] Let me ask you a second quiz. Again, we haven't talked about any details. I would like to get your intuition on the following question. Suppose in a two dimensional space, all data lies as follows. This may be reminiscent of the question I asked you for housing prices and square footage. Suppose we have 2 axes, X1 and X2. I'm going to ask you 2 questions here. One is what is the dimensionality of this space in which this data falls, and the second one is an intuitive question which is how many dimensions do you need to represent this data to capture the essence, and, again, this is not a clear crisp 0 or 1 type question, but give me your best guess. How many dimensions are intuitively needed?
[Narrator] So to start with some lingo about unsupervised learning. If you look at this as a probabilist, you're given data, and we interpretively assume the data is IID, which means identically distributed and independently drawn from the same distribution. So a good chunk of unsupervised learning seeks to recover the underlying--the density of probability distribution that generated the data. It's called density estimation. As we find out our methods for clustering, our versions of density estimation using what's called mixture models. Dimensionality reduction is also a method for doing density estimation, and there are many others. Unsupervised learning can be applied to find structure and data. One of the fascinating ones that I believe exists is called blind signals separation. Suppose you are given a microphone, and two people simultaneously talk, and you're record the joint of both of those speakers. Blind source separation or blind signal separation addresses the question of can you recover those two speakers and filter the data into two separate streams. One for each speaker. Now this is a really complicated unsupervised learning task, but is one of the many things that don't require target signals as unsupervised learning yet make for really interesting learning problems. This can be construed as an example of what's called factor analysis where each speaker is a factor in the drawing signal that your microphone records. There are many other examples of unsupervised learning. I will show you a few in a second.
Here is one of my favorite examples of unsupervised learning-- one that is yet unsolved. At Google, I had the opportunity to participate-- in the building of Street View, which is a huge photographic database-- of many, many streets in the world. As you dive into Street View-- you can get ground imagery-- of almost any location in the world-- like this house here, that I chose at random. In these images, there is vast regularities. You can go somewhere else-- and you'll find that the type of objects-- visible in Street View-- is not entirely random. For example, there is many images of homes-- many images of cars-- trees, pavement, lane markers-- stop sign, just to name a few. So one of the fascinating, unsolved, unsupervised learning tasks is: Can you take hundreds of billions of images-- as comprised in the Street View data set-- and discover from it that there are concepts such as-- trees, lane markers, stop signs, cars, and pedestrians? It seems to be tedious to hand label each image-- for the occurrence of such objects. And attempts to do so-- has resulted in very small image data sets. Humans can learn from data-- even without explicit target labels. We often just observe. In observing, we apply unsupervised learning techniques. So one of the great, great open questions of artificial intelligence is: Can you observe many intersections and many streets and many roads-- and learn from it what concepts are contained in the imagery? Of course, I can't teach you anything as complex in this class. I don't even know the answer myself. So let me start with something simple. Clustering. Clustering is the most basic form of unsupervised learning. And I will tell you about two algorithms that are very related. One is called k-means, one is called expectation maximization. K-means is a nice, intuitive algorithm to derive clusterings. Expectation maximization is a probabilistic-- generalization of k-means. They were derived from first principles.
Let me explain k-means by an example. Suppose we're given the following data points in a 2-dimensional space. K-means estimates for a fixed number of k. Here k = 2. The best centers of clusters representing those data points. Those are found interatively by the following algorithm. Step 1: Guess cluster centers at random, as shown over here with the 2 stars. Step 2: Assign to each cluster center, even though they are randomly chosen, the most likely corresponding data points. This is done by minimizing Euclidian distance. In particular, each cluster center represents half of the space. And the line that separates the space between the left and right cluster center is the equidistant line, often called a Voronoi graph. All the data points on the left correspond to the red cluster, and the ones on the right to the green cluster. Step 3: Given now we have a correspondence between the data points and cluster centers, find the optimal cluster center that corresponds to the points associated with the cluster center. Our red cluster center has only 2 data points attached. So the optimal cluster center would be the halfway point in the middle. Our right cluster center has more than 2 points attached; yet it isn't placed optimally, as you can see as they move with the animation back and forth. By minimizing the joint quadratic distance to all of those points, the new cluster center has attained the center of those data points. Now the final step is iterate. Go back and reassign cluster centers. Now the Voronoi diagram has shifted, and the points are associated differently, and then reevaluate what the optimal cluster center looks like given the associated points. And in both cases we see significant motion. Repeat. Now this is the clustering. The point association doesn't change, and as a result, we just converged.
You just learned about an exciting clustering algorithm that's really easy to implement called k-means. To give you the algorithm in pseudocode, initially we select k cluster centers at random and then we repeat. In a corresponding step, we correspond all the data points to the nearest cluster center, and then we calculate the new cluster center by the mean of the corresponding data points. We repeat this until nothing changes any more. Now special care has to be taken if a cluster center becomes empty--that means no data point is associated. In which case, we just restart cluster centers at random that have no corresponding points. Empty cluster centers restart at random. This algorithm is known to converge to a locally optimal clustering of data ponts. The general clustering problem is known to be NP-hard. So a locally optimal solution, in a way, is the best we can hope for. Now let me talk about problems with k-means. First we need to know k, the number of cluster centers. As I mentioned, the local minimum. For example, for 4 data points like this and 2 cluster centers that happen to be just over here, with the separation line like this there would be no motion of k means. Even though moving one over here and one over there would give a better solution. There's a general problem of high dimensionality of the space that is not dissimilar from the way k-nearest neighbor suffers from high dimensionality. And then there's lack of a mathematical basis. Now if you're a partitioner, you might not care about a mathematical basis. But for the sake of this class, let's just care about it. So here's a first quiz for k-means. Given the following two cluster centers, C1 and C2, click on exactly those points that are associated with C1 and not with C2.
And the answer is these 4 points over here. And the reason is, if you draw the line of equal distance between C1 and C2, the separation of these 2 cluster areas falls over here. C2 is down there. C1 is up here.
So here's my second quiz. Given the association that we just derived for C1, where do you think the new cluster center, C1, will be found after a single step of estimating its best location given the associated points? I'll give you a couple of choices. Please click on the one that you find most plausible.
And the answer is, over here. These 4 data points are associated with C1, so we can safely ignore all the other ones. This one over here would be at the center of the 3 data points over here, but this one pulls back this data point drastically towards it. This is about the best trade-off between these 3 points over here that all have a string attached and pull in this direction, compared to this point over here. Any of the other ones don't even lie between those points, and therefore won't be good cluster centers. The one over here is way too far to the right.
In our next quiz let's assume we've done one interation, and the cluster center of C1 moved over there and C2 moved over here. Can you once again click on all those data points that correspond to C1?
And the answer is now simple. It's this one over here. This one, this one, and this one. And the reason is, the line separating both clusters runs around here. That means all the area over here is C2 territory, and the area over here is C1 territory. Obvioulsy as we now iterate k-means, these clusters that have been moved straight over here will be able to stay, whereas C2 will end up somewhere over here.
So, let's now generalize k-means into expectation maximization. Expectation maximization is an algorithm that uses actual probability distributions to describe what we're doing, and it's in many ways more general, and it's also nice in that it really has a probabilistic basis. To get there, I have to take the discourse and tell you all about Gaussians, or the normal distribution, and the reason is so far, we've just encountered discrete distributions, and Gaussians will be the first example of a continuous distribution. Many of you know that a Gaussian is described by an identity that looks as follows, where the mean is called mu, and the variance is called sigma or sigma squared. And for any X along the horizontal access, the density is given by the following function: of minus ½ of x - mu squared over sigma squared. This function might look complex, but it's also very, very beautiful. It peaks at X = mu where the value in the exponent becomes 0. And towards plus or minus infinity, it goes to 0 quickly. In fact, exponentially fast. The argument inside is a quadratic function. The exponential function makes it exponential. And this over here is a normalizer to make sure that the area underneath sums up to one, which is characteristic of any probability density function. If you map this back to our discrete random variables, for each possible X, we can now assign a density value, which is the function of this, and that's effectively the probability that this X might be drawn. Now, the space itself is infinite, so any individual value will have a probability of 0, but what you can do is you can make an interval, A and B, and the area underneath this function is the total probability that an experiment will come up between A and B. Clearly, it's more likely to generate values around mu then it is to generate values in the periphery summary over here. And just for completeness, I'm going to give you the formula for what's called the multi-variate Gaussian where multi-variate means nothing else but we have more than one input variable. You might have a Gaussian over a 2-dimensional space or a 3-dimensional space. Often, these Gaussians are drawn by what's called level sets, sets of equal probability. Here's one in a 2-dimensional space, X1 and X2. The Gaussian itself can be thought of as coming out of the paper towards me where the most likely or highest point of probability is the center over here. And these rings measure areas of equal probability. The formula for a multi-variate Gaussian looks as follows: N is the number of dimensions in the input space. Sigma is a covariance matrix that generalizes the value over here. And the inner product inside the exponential is now done using linear algebra where this is the difference between a probe point and the mean vector mu transposed sigma to the minus 1 times X - mu. You can find this formula in any textbook or web page on Gaussians or multi-variate normal distributions. It looks cryptic at first, but the key thing to remember is it's just a generalization of the 1-dimensional case. We have a quadratic area over here as manifested by the product of this guy and this guy. We have a normalization by a variance or covariance as shown by this number over here or the inverse matrix over here. And then this entire thing is an exponential form in both cases, and the normalizer looks a little more different in the multi-variate case, but all it does is make sure that the volume underneath adds up to 1 to make it a legitimate probability density function. For most of this explanation, I will stick with 1-dimensional Gaussians, so all you have to do is to worry about this formula over here, but this is given just for completeness.
I will now talk about fitting Gaussians to data or Gaussian learning. You may be given some data points, and you might worry about what is the best Guassian fitting the data? Now, to explain this, let me first tell you what parameters characterizes a Gaussian. In the 1-dimensional case, it is mu and sigma squared. Mu is the mean. Sigma squared is called the variance. If we look at the formula of a Gaussian, it's a function over any possible input X, and it requires knowledge of mu and sigma squared. And as before, I'm just restating what I said before. We get this function over here that specifies any probability for a value X given a specific mu and sigma squared. Suppose we wish to fit data, and our data is 1-dimensional, and it looks as follows. Just looking at this diagram makes me believe that there's a high density of data points over here and a fading density of data points over there, so maybe the most likely Gaussian will look a little bit like this where this is mu and this is sigma. They are really easy formulas for fitting data to Gaussians, and I'll give you the result right now. The optimal or most likely mean is just the average of the data points. There's M data points, X1 to Xm. The average will look like this. The sum of all data points divided by the total number of data points. That's called the average, and once you calculate the average, the sigma squared is obtained by a similar normalization in a slightly more complex sum. We sum the deviation from the mean and compute the average deviation to the square from the mean, and that gives us sigma squared. So, intuitively speaking, the formulas are really easy. Mu is the mean, or the average. Sigma squared is the average quadratic deviation from the mean, as shown over here.
Now I want take a second to convince ourselves this is indeed the maximum likelihood estimate of the mean and the variance. Suppose our data looks like this-- There's "M" data points. And the probability of those data points for any Gaussian model--mu and sigma squared is the product of any individual of data likelihood--x,i. And if you plug in our Gaussian formula, you get the following-- This is the normalizer multiplied "M" times where the square root is now drawn into the half over here, and here is our joint exponential. We took the product of the individual exponentials and moved it up straight in here where it becomes a sum. So the best estimates for mu and sigma squared are those that maximize this entire expression over here for given data set X1 to Xm. So we seek to maximize this over the unknown parameters mu and sigma squared. And now I will apply a trick. Instead of maximizing this expression, I will maximize the logarithm of this expression. The logarithm is a monotonic function. So let's maximize instead the logarithm where this expression over here resolves to this expression over here. The multiplication becomes a minus sign from over here, and this is the argument inside the exponent written slightly differently, but pulling the 2 sigma squared to the left. So let's maximize this one instead. The maximum was obtained where the first derivative is zero. If we do this for our variable mu, we take the "log f" expression and complete the derivative for spectrum mu, we get following-- This expression does not depend on mu at all, so it falls out. And we can still get this expression over here, which we've set to zero. And now we can multiply everything by sigma squared next to zero, and then bring the Xi to the right and the mu to the left. The sum over all "E" of the mu is mu equals sum over i, xi. Hence, we proved that the mean is indeed the maximum likelihood estimate for the Gaussian. This is now easily repeated for the variance. If you compute the derivative of this expression over here with respect to the variance, we get minus "m" over sigma, which happens to be the derivative of this expression over here. Keep in mind that the derivative of a logarithm stresses internal argument times by chain rule--the derivative of the internal argument, which if you work out becomes this expression over here. And this guy over here changes signs but becomes the following. And again, you move this guy to the left side, multiply by sigma cubed, and divide by "m". So we get the following result over here. You might take a moment to verify these steps over here, I was a little bit fast, but this is relatively straight forward mathematics. And if you will verify them, you will find that the maximum likelihood estimate for sigma squared is the average deviation of data points from the mean mu. This gives us a very nice basis to fit Gaussians to data points. So keeping these formulas in mind, here's a quick quiz, which I ask you to actually calculate the mean and variance for a data sequence. So suppose the data you observe is 3, 4, 5, 6, and 7. There is 5 data points. Compute for me the mean and the variance using the maximum likelihood estimator I just gave you.
So the mean is obviously 5, it's the middle value over here. If I add those things together, I get 25 and divide by 5. The average value over here is 5. The more interesting case is sigma square, and I do this in the following steps-- I subtract 5 from each of the data points for which I get -2, -1, 0, 1, and 2. I square those differences, which gives me 4, 1, 0, 1, 4. And now I compute the mean of those square differences. To do so, I add them all up, which is 10.
Here is another quiz--Suppose my DATA looks as follows--3,9,9,3. Compute for me mu and sigma squared using the maximum likelihood estimator I just gave you.
And the answer is relatively easy. divided by m = 4 is 6 so the mean value is 6 subtracting the mean from the data gives us -3, 3, 3, and -3 squaring those gives us 9, 9, 9, 9 and the mean of 4 nines equals 9.
I now have a more challenging quiz for you in which I give you multivariant data in this case 2-dimensional data. So suppose my data goes as follows. In the first column I get 3, 4, 5, 6, 7 these are 5 data points and this is the first feature and the second feature will be 8, 7, 5, 3, 2. The formulas for calculating Mu and the covariance matrix Sigma generalize the ones we studied before and they are given over here. So what I would like you to compute is the vector Mu which now has 2 values one for the first and one for the second column and the variance Sigma which now has 4 different values using the formula shown over here.
Now the mean is calculated as before independently for each of the 2 features here. Three, 4, 5, 6, 7, the mean is 5. Eight, 7, 5, 3, 2, the mean is 5 again. Easy calculation. If you subtract the mean from the data we get the following matrix and now we just have to plug it in. For the main diagonal elements you get the same formula as before. You can do this separately for each of the 2 columns. But for the off-diagonal elements you just have to plug it in. So this is the result after plugging it in and you might just want to verify it using a computer.
So this finishes the lecture on Gaussians. You learned about what a Gaussian is. We talked about the fit from data and we even talked about multivariate Gaussians. But even though I asked you to fit one of those the one we are going to focus on right now is the one-dimensional Gaussian. So let's now move back to the expectation maximization algorithm.
It is now really easy to explain expectation maximization as a generalization of K-means. Again, we have a couple of data points here and 2 randomly chosen cluster centers. But in the correspondence step instead of making a hard correspondence we make a soft correspondence. Each data point is attracted to a cluster center in proportion to the posterior likelihood which we will define in a minute. In the adjustment step or the maximization step the cluster centers are being optimized just like before but now the correspondence is a soft variable and they correspond to all data points in different strengths not just the nearest ones. As a result, in EM the cluster centers tend not to move as far as in K-means. Their movement is smooth away. A new correspondence over here gives us different strength as indicated by the different coloring of the links and another relaxation step gives us better cluster centers. And as you can see over time, gradually the EM will then converge to about the same solution as K-means. However, all the correspondences are still alive. Which means there is not a 0, 1 correspondence. There is a soft correspondence which relates to a posterior probability, which I will explain next.
The model of expectation maximization is that each data point is generated from what's called a mixture. The sum of all possible classes or clusters, of which there are K we draw a class at random with a prior probability of p of the class C = i and then we draw data point X from the distribution correspondent with its class over here. The way to think about this if there is K different cluster centers shown over here each one of those has a generic Gaussian attached. In the generative version of expectation maximization you first draw a cluster center and then we draw from the Gaussian attached to this cluster center. The unknowns here are the prior probabilities for each cluster center should we call P-i and the Mu-i and in the general case Sigma-i for each of the individual Gaussian. Where i = 1 all the way to K. Expectation maximization iterates 2 steps just like K-means. One is called the E-step or expectation step for which we assume that we know the Gaussian parameters and the P-i. With those known values calculating the sum over here is a fairly trivial exercise. This is our known formula for a Gaussian we just plug that in and this is a fixed probability. The sum of all possible classes. So you get for e-ij the probability that the j-th data point corresponds to cluster center number i P-i times the normalizer times the Gaussian expression. Where we have a quadratic of Xj minus Mu-i times Sigma-i to the -1 times the same thing again over here. These are the probabilities that the j-th data point corresponds to the i-th cluster center under the assumption that we do know the parameters P-i, Mu-i, and Sigma-i. In the M-step we now figure out where these parameters should have been. For the prior probability of each cluster center we just take the sum over all the e-ijs, over all data points divided by the total number of data points. The mean is obtained by the weighted mean of the x-js normalized by the sum over e-ijs and finally the sigma is obtained as a sum over the weighted expression like this and this is the same expression as before and now again we are normalizing over the sum over all e-ijs. And these are exactly the same calculations as before when we fit a Gaussian but just weighted by the soft correspondence of a data point to each Gaussian. And this weighting is relatively straightforward to apply in Gaussian fitting. Let's do a very quick quiz for EM. Suppose we're given 3 data points and 2 cluster centers. And the question is, does this point over here called X1 correspond to C1 or C2 or both of them? Please check exactly one of these 3 different check boxes here.
[Thrun] And the answer is both of them, and the reason is X1 might be closer to C2 than C1, but the correspondence in EM is soft, which means each data point always corresponds to all cluster centers. It is just that this correspondence over here is much stronger than the correspondence over here.
[Thrun] Here is another EM quiz. For this quiz we will assume a degenerative case of 3 data points and just 1 cluster center. My question pertains to the shape of the Gaussian after fitting, specifically M1 sigma. And the question is, is sigma circular, which would be like this, or elongated, which would be like this or like this?
[Thrun] And the answer is, of course, elongated. As you look over here, what you find is that this is the best Gaussian describing the data points, and this is what EM will calculate.
[Thrun] This is a quiz in which I compare EM versus K-means. Suppose we are giving you 4 data points, as indicated by those circles. Suppose we have 2 initial cluster centers, shown here in red, and those converge to possible places that are indicated by those 4 squares. Of course they won't take all 4 of them; they will just take 2 of them. But for now I'm going to give you 4 choices. We call this cluster 1, cluster 2, A, B, C, and D. In EM will C1 move towards A or will C1 move towards B? And in contrast, in K-means will C1 move towards A or will C1 move towards B? This is just asking about the left side of the diagram. So the question is will K-means find itself in the more extreme situation, or will EM find itself in the more extreme situation?
[Thrun] And the answer is that while K-means will go all the way to the extreme, A, which is this one over here, EM will not. And this has to do with the soft versus hard nature of the correspondence. In K-means the correspondence is hard. So after the first situation, only these 2 data points over here correspond to cluster center 1, and they will find themselves straight in the middle where A is located. In EM, however, we find that there will still be a soft correspondence to these further away points which will then lead to a small shift of the cluster center to the right side, as indicated by B. That means K-means and EM will converge at different models of the data.
[Thrun] One of the remaining open questions pertains to the number of clusters. So far I've assumed it's simply constant and you know it. But in reality, you don't know it. Practical implementations often guess the number of clusters along with the parameters. And the way this works is that you periodically evaluate which data is poorly covered by the existing mixture, you generate new cluster centers at random near unexplained points, and then you run the algorithm for a while to see whether the existence of your clusters is still justified. And the justification test is based on a memorization of a criterion that combines the negative log likelihood of your data itself and a penalty for each cluster. In particular, you're going to minimize the negative log likelihood of your data given the model plus a constant penalty per cluster. If we look at this expression, this is the expression that EM already minimizes. We maximized the posterior probability of data logarithmic is a monotonic function, and I put a minus sign over here so the optimization problem becomes a minimization problem. This one over here, we have a constant cost per cluster is new. If you increase the number of clusters, you would pay a penalty that is in the way of your attempted minimization. Typically, this expression balances out at a certain number of clusters, and it is generically the best explanation for your data. So the algorithm looks as follows. Guess an initial K, run EM, remove unnecessary clusters that will make this quote over here go up, create some new random clusters, and go back and run EM. There is all kinds of variants of this algorithm. One of the nice things here is this algorithm also overcomes local minima problems to some extent. If, for example, 2 clusters end up grabbing the same data, then your tests would show you that 1 of the clusters can be omitted; thereby the score can be improved. That cluster can later be restarted somewhere else, and by randomly restarting clusters, you tend to get a much, much better solution than if you run EM just once with a fixed number of clusters. So this trick is highly recommended for any implementation of expectation maximization.
[Thrun] This finishes my unit on clustering, at least so far. I just want to briefly summarize what we've learned. We talked about K-means, and we talked about expectation maximization. K-means is a very simple almost binary algorithm that allows you to find cluster centers. EM is a probabilistic generalization that also allows you to find clusters but also modifies the shapes of the clusters by modifying the covariance matrix. EM is probabilistically sound, and you can prove convergence in a log likelihood space. K-means also converges. Both are prone to local minima. In both cases you need to know the number of cluster centers, K. I showed you a brief trick how to estimate the K as you go, which also overcomes local minima to some extent.
Let's now talk about a 2nd class of unsupervised learning avenues that are called dimensionality reduction. We're going to start with a little quiz, in which I will check your intuition. Suppose we're given a 2-dimensional data field, and our data lines up as follows. My quiz is: How many dimensions do we really need? The key is the word really, which means we're willing to tolerate a certain amount of error in accuracy, because we're going to capture the essence of the problem.
The answer is obviously 1. This is the key dimension over here. The orthogonal dimension in this direction carries alomst information, so it suffices, in most cases, to project the data onto this 1 dimensional space.
Here is a quiz that is a little bit more tricky. I'm going to draw data for you like this. I'm going to ask the same question. How many dimensions do we really need?
This answer is not at all trivial, and I don't blame you if you get it wrong. The answer is actually 1, but the projection itself is nonlinear. I can draw, really easily, a nice 1-dimensional space that follows these data points. If I am able to project all the data points on this 1-dimensional space, I capture the essence of the data. The trick, of course, is to find the nonlinear 1-dimensional space and describe it. This is what's going on in the state-of-the-art in dimensionality reduction research.
For the remainder of this unit, I am going to talk about linear dimensionality reduction. Where the idea is that the given data points like this, and we seek to find a linear subspace in which to perfect the data. In this case, I would submit this is probably the most suitable linear subspace. So we remap the data onto the space over here, with x1 over here and x2 over here. Then we can capture the data in just 1 dimension. The algorithm is amazingly simple. Number 1: Fit a gaussian; we now know how this works. The gaussian will look something like this. Number 2: Caluclate the eigenvalues and eigenvectors of this gaussian. In this gaussian this would be the dominant eigenvector, and this would be the 2nd eigenvector over here. Step 3 is take those eigenvectors whose eigenvalues are the largest. Step 4 is to project the data onto the subspace of eigenvectors you chose. Now to understand this, you have to be familiar with eigenvectors and eigenvalues. I give you an intuitive familiarity with those. This is standard statistics material, and you will find this in many linear algebra classes. So let me just go through this very quickly and give you an intuition how to do linear dimensionality reduction. Suppose you're given the following data points: Your axes are 0, 1, 2, 3, and 4, These are essentially 2, 3, 4, 5, 6, but slightly modified to define actual variance over this dimension. So I draw this in here. What I get is a set of points that doesn't quite fit a line, but almost. There is a little error over here, a little error over here, and here and here. The mean is easily calculated; it's 2 and 4. The covairance matrix looks as follows. Notice the slightly different covairance for the 1st variable, which is exactly 2, to the 2nd variable, which is 2.008. The eigenvectors happen to be 0.7064 and 0.7078 with an eigenvalue of 4.004, and the 2nd one is orthogonal with an eigenvalue much smaller. So obviously this is the eigenvector that dominates the spread of the data points. If you look at this vector over here, it is centered around the mean, which sits over here, and is exactly this vector shown over here. Where this one is the orthogonal vector shown over here. So this single dimension with a large weight explains the data relative to any other dimension, which is a very small eidenvalue. I should mention why these numerical examples might look confusing. This is very standard linear algebra. When you estimate covariance from data and try to understand which direction they point, this kind of eigenvalue anylysis gives you the right answer.
The dimensionality reduction looks a little bit silly when you go from 2 dimensions to 1 dimension. But in truly high-dimensional space it has a very strong utility. Here's an example that goes back to MIT several decades ago on something called eigenfaces. These are all well-aligned faces. The objective in eigenface research has been to find simple ways to describe different people in a parameter space, in which we can easily identify the same person again. Images like these are very high-dimensional statistics. If each image is 50 by 50 pixels, each image itself becomes a data point in a 2500 dimensional feature space. Now obviously, we don't have random images. We don't fill the space of 2500 dimensions with all face images. Instead, it is reasonable to assume that all the faces live on a small subspace in that space. Obviously, you as a human can easily distinguish what is a valid image of a face and what is a valid image of a non face, like a car or a cloud or the sky. Therefore, there are many, many images that you can represent with 2500 pixels that are not faces. So research on eigenfaces has applied principle component analysis and eigenvalues to the space of faces. Here is a database in which faces are aligned. A researcher at Santiago Serrano extracted from it the average face after alignment on the right side. The truly interesting phenomenon occurs when you look at the eigenvalues. The face on the top left, over here, is the average face, and these are the variations, the eigenvectors that correspond to the largest eigenvalues over here. This is the strongest variation. You see a certain amount of different regions in and around the head shape and the hair that gets excited. That's the 2nd strongest one, where the shirt gets more excited. As you go down, you find more and more interesting variations that can be used to reconstruct faces. Typically a dozen or so will suffice to make a face completely reconstructable, which means you've just mapped a 2500 dimensional feature space into a, perhaps, 12 dimensional feature space on which we can now learn much, much easier.
In our own reserch, we also have applied eigenvector decomposition to relatively challenging problems that don't look like a linear problem at the surface. We scanned a good number of people with different physiques: Some thin, some not so thin, some tall, some short, some male, some female. We also scanned them in 3-D in different body postures: The arms down, the arms up, walking, throwing a ball, and so on. We applied eigenvector decomposition of the type I've just shown you to understand whether there is a latent low-dimensional space that is sufficient to represent the different physiques that people have, like thin or thick, and the different postures people can assume, like standing and so on. It turns out if you apply eigenvector decomposition to the space of all the formations of your body, you can find relatively low dimensional linear spaces, in which you can express different physiques and different body postures. For the space of all different physiques it turns only 3-dimensions sufficed to explain different heights, different thicknesses or body weights, and also different genders. That is, even though our surfaces themselves are representive of tens of thousands of data points, the underlying dimensionality when scanning people is really small. I'll let you watch the entire movie. Please enjoy. [SCAPE: Shape Completion and Animation of People] We present a method named SCAPE for simultaneously modeling the space of all human shapes and poses. Further, we demonstrate the method's usefulness for both shape completion and animation. The model is computed from an example set of surface meshes. We require only a limited set of training data: Examples of posed variation from a single subject and examples of the shape variation between subjects. The resulting model can represent both articulated motion and, importantly, the nonrigid muscle deformations required for natural appearance in a wide variety of poses. The model can also represent a wide variety of different body shapes, spanning both men and women. Because SCAPE incorporates both shape and pose we can jointly vary both shape and pose to create people who never existed and poses that were never observed. We demonstrate the use of this model 1st for shape completion of scanned meshes. Even when a subject has only been partially observed, we can use the model to estimate a complete surface. In this case, the entire front half of the subject has been synthesized. Note that the synthesized data both conforms to the individual subject's specific shape and faithfully represents the nonrigid muscle deformations associated with a specific pose. Mesh completion is possible even when neither the person or the pose exists in the original training set. None of the women in our example set look similar to the woman in this sequence. Shape completion can also be used to synthesize complete animated surface meshes. Starting from a single scanned mesh of an actor and a timed series of motion capture markers we can treat the markers themselves as a very sparse sampling of surface geometry and complete the surface which best fits the available data at each point in time. Using this method, animated surface models for a wide variety of motions can be created with relative ease. In addition, the target identity of the surface model can easily be changed simply by replacing the subject portion of our factorized model with a different vector. The new identity need not be present in our training set or even correspond to a real person. An artist is free to alter the identity arbitrarily.
[Thrun] In modern dimensionality reduction, the trick has been to define nonlinear, sometimes piece-wise linear, subspaces on which data is being projected. This is not dissimilar from K nearest neighbors, where local regions are being defined based on local data neighborhoods. But here we need ways to interpret leveraging neighbors to make sure that the subspace itself becomes a feasible subspace. Common methods include local linear embedding, or LLE, or the Isomap method. If you're interested in this, check the Web. There's tons of information on these methods on the World Wide Web.
We now talk about spectral clustering. The fundamental idea of spectral clustering is to cluster by affinity. And to understand the importance of spectral clustering, let me ask you a simple intuitive quiz. Suppose you are given data like this, and you wish to learn that there's 2 clusters-- a cluster over here and a cluster over here. So my question is, from what you understand, do you think that "EM" or "K" means we would do a great job finding those clusters or do you think they will likely fail to find those clusters? So what were the questions--Do "EM" or "K" mean succeed in finding the 2 clusters? There is a likely yes and a likely no.
And the answer is likely no. The reason being that these aren't clusters defined by a center of data points, but they're clusters define by affinity, which means they're defined by the presence of nearby points. So take for example the area over here, which I'm going to circle, and ask yourself, what's the best cluster center? It's likely somewhere over here where I drew the red dot. This is the cluster center for this cluster, and perhaps this is the cluster center for the other cluster. And these points over here will likely be classified as belonging to the cluster center over here. So, "EM" will likely do a bad job.
So let's look at this example again--let me redraw the data. What makes these clusters so different is not the absolute location of each data point, but the connectedness of these data points. The fact that these 2 points belong together is likely because there's lots of points in-between. In other words, it's the affinity that defines those clusters, not the absolute location. So spectral clustering gets annotation of affinity to make clustering happen. So let me look at the simple example for spectral clustering that would also work for K-means or EM, but they'll be useful to illustrate spectral clustering. Let's assume there's 9 data points as shown over here, and I've colored them differently in blue, red, and black. But to clustering algorithms, they all come with the same color. Now the key element of spectral clustering is called the affinity martrix, which is a 9 by 9 matrix in this case, where each data point gets graphed realtive to each other data point. So let me write down all the 9 data points into the different rows of this matrix-- the red ones, the black ones, and the blue ones. And in the columns, I graphed the exact same 9 data points. I then calculate for each pair of data points their affinity, where I use for now affinity as the quadratic distance in this diagram over here. Clearly, the red dots to each other have a high affinity, which means a small quadratic distance. Let me indicate this as follows-- But realtive to all the other points, the affinity is weak. So there's a very small value in these elements over here. Similarly, the affinity of the black data points to each other is very high, which means that the following block diagonal in this matrix will have a very large value. Yet the affinity to all the other data points will be low. And of course, the same is true for the blue data points. The interesting thing to notice now is that this is an approximately rank-deficient matrix. And further, the data points that belong to the same class-- like the 3 red dots or the 3 black dots, have a singular affinitive vector to all the other data points. So this vector over here is similar to this vector over here. It's similar to this vector over here, but it's very different to this vector over here, which then itself is similar to the vector over here, yet different to the previous ones. Such a situation is easily addressed by what's called principal component analysis, or PCA. PCA is a method to identify vectors that are similar in an approximate rank-deficient matrix. Consider once again our affinity matrix with prinicple component analysis, which is a standard linear trick, we can re-represent this matrix by the most dominant tivectors you'll find there. And the first one, might look like this. The second one, which would be orthogonal, may look like this. The third one, like this. These are called eigenvectors, and the principle component now is each eigenvector has an item of value that states how prevalent this vector is in the original data. And for these 3 vectors, you're going to find a large eigenvalue because there's a number data points that represent these vectors quite prevalently like the first 3 does for this guy over here. There might be additional eigenvectors like something like this, but such eigenvectors will have a small eigenvalue simply because this vector isn't really required to explain the data over here. It might just be explaining some of the noise in the affinity matrix that I didn't even dare draw in here. Now if you take the eigenvectors with the largest eigenvalues--3 in this case, you first discover that the dimensionality of the underlying data space. The dimensionality equals the number of large eigenvalues. Further, if you re-represent each data vector using those eigenvectors, you'll find a 3 dimensional space where original data falls into a varity of different places. And these places are easily told apart by conventional clustering. So in summary, spectral clustering builds an affinity matrix of the data points. It strikes the eigenvectors with the largest eigenvalues, and then re-map those vecotrs into a new space with the data points easily clustering the conventional way. This is called affinity-based clustering or spectral clustering. Let me illustrate this once again with the data set that has a different spectral clustering than a conventional clustering. In this data set, the different clusters belong together because they're affinity is similar. These 2 points belong together because there is a point in-between. If we now draw the affinity matrix for those data points, you find that the first and second data points are close together and the second and the third, but not the first and the third. Hence these 2 off diagonal elements here have remained small. Similarly for the red points as shown here with these 2 elements over here relatively small. And also for the black points where these 2 elements over here are small. And interestingly enough, even though these aren't blocked diagonal, your first 3 largest eigenvectors will still look the same as before. I find this quite remarkable that even though these aren't exactly blocks, those vecotrs still represent the 3 most important vectors for which to recover the data using principle component analysis. So in this case, spectral clustering would easily assign those guys and those guys and those guys to the respective same cluster, which wouldn't be quite as easily the case for expectation-maximization or k-means. So let me ask you the following quiz. Suppose we have 8 data points. How many elements will the affinity matrix have?
And the answer is 64. There's 8 data points--8 times 8 is 64.
My second question is, how many large eigenvalues will PCA find? Now I understand this doesn't have a unique answer, but in the best possible case where spectral clustering works well, how many large eigenvalues do you find?
And the answer is 2. There's a cluster over here and a cluster over here. And while it might happen that it's as many as 8, if you adjust you're affinity matrix well, those 2 should correspond with the 2 larger eigenvalues.
So, congratulations. You just made it through the unsupervised learning section of this class. I think you've learned a lot. You learned about K-means, you learned about expectation maximization, about dimensionality reduction and even spectral clustering. The first 3 items--K-means, EM, and dimensionality reduction-- are used very frequently, and spectral clustering is a rarer used method that shows some of the most recent research going on in the field. I hope you have fun applying these methods in practice. I'd like to say a few final words about supervised versus unsupervised learning. In both cases you're given data, but in 1 case you have labeled data, in another you have unlabeled data. The supervised learning paradigm is the dominant paradigm in machine learning, and there are a vast amount of papers being written about it. We talked about classification and regression and different methods to do supervised learning. The unsupervised paradigm is much less explored, even though I think it's at least equally important--possibly even more important. Many systems can collect vast amounts of data such as web crawlers, robots, I told you about street view, and getting the data is cheap, but getting labels is hard. So to me, unsupervised is the method of the future. It's one of the most interesting open research topics to see whether we can make sense out of large amounts of unlabeled or poorly labeled data. In between, there are techniques that do both: supervised and unsupervised. They are called semi-supervised or self-supervised, and they use elements of unsupervised learning and pair them with supervised learning. Those are fascinating by their own rights. Our robot Stanley, for example, that won the DARPA Grand Challenge used its own sensors to produce labels on the fly to other data. And I'll talk about this when I talk about robotics in more detail. But for the time being, understand that the paradigms supervised and unsupervised span 2 very large areas of machine learning, and you learn quite a bit about it.