Welcome to the machine learning lesson. Machine learning is a fascinating area. The world has become immeasurably data-rich. The world wide web has come up over the last decade. The human genome is being sequenced. Vast chemical databases, pharmaceutical databases, and financial databases are now available on a scale unthinkable even 5 years ago. To make sense out of the data, to extract information from the data, machine learning is the discipline to go. Machine learning is an important subfeed of artificial intelligence, it's my personal favorite next to robotics because I believe it has a huge impact on society and is absolutely necessary as we move forward. So in this class, I teach you some of the very basics of machine learning, and in our next lesson Peter will tell you some more about machine learning. We'll talk about supervised learning, which is one side of machine learning, and Peter will tell you about unsupervised learning, which is a different style. Later in this class we will also encounter reinforcement learning, which is yet another set of machine learning. Anyhow, let's just dive in.
Welcome to the first class on machine learning. So far we talked a lot about Bayes Networks. And the way we talked about them is all about reasoning within Bayes Networks that are known. Machine learning addresses the problem of how to find those networks or other models based on data. Learning models from data is a major, major area of artificial intelligence and it's perhaps the one that had the most commercial success. In many commercial applications the models themselves are fitted based on data. For example, Google uses data to understand how to respond to each search query. Amazon uses data to understand how to place products on their website. And these machine learning techniques are the enabling techniques that make that possible. So this class which is about supervised learning will go through some very basic methods for learning models from data in particular, specific types of Bayes Networks. We will complement this with a class on unsupervised learning that will be taught next after this class. Let me start off with a quiz. The quiz is: What companies are famous for machine learning using data? Google for mining the web. Netflix for mining what people would like to rent on DVDs. Which is DVD recommendations. Amazon.com for product placement. Check any or all and if none of those apply check down here.
And, not surprisingly, the answer is all of those companies and many, many, many more use massive machine learning for making decisions that are really essential to the businesses. Google mines the web and uses machine learning for translation, as we've seen in the introductory level. Netflix has used machine learning extensively for understanding what type of DVD to recommend to you next. Amazon composes its entire product pages using machine learning by understanding how customers respond to different compositions and placements of their products, and many, many other examples exist. I would argue that in Silicon Valley, at least half the companies dealing with customers and online products do extensively use machine learning, so it makes machine learning a really exciting discipline.
In my own research, I've extensively used machine learning for robotics. What you see here is a robot my students and I built at Stanford called Stanley, and it won the DARPA Grand Challenge. It's a self-driving car that drives without any human assistance whatsoever, and this vehicle extensively uses machine learning. The robot is equipped with a laser system I will talk more about lasers in my robotics class, but here you can see how the robot is able to build These are almost like video game models that allow it to make assessments where to drive and where not to drive. Essentially, it's trying to drive on flat ground. The problem with these lasers is that they don't see very far. They see about 25 meters out, so to drive really fast the robot has to see further. This is where machine learning comes into play. What you see here is camera images delivered by the robot superimposed with laser data that doesn't see very far, but the laser is good enough to extract samples of driveable road surface that can then be machine learned and extrapolated into the entire camera image. That enables the robot to use the camera to see driveable terrain all the way to the horizon up to like 200 meters out, enough to drive really, really fast. This ability to adapt its vision by driving its own training examples using lasers but seeing out 200 meters or more was a key factor in winning the race.
Machine learning is a very large field with many different methods and many different applications. I will now define some of the very basic terminology that is being used to distinguish different machine learning methods. Let's start with the what. What is being learned? You can learn parameters like the probabilities of a Bayes Network. You can learn structure like the arc structure of a Bayes Network. And you might even discover hidden concepts. For example you might find that certain training example form a hidden group. For example Netflix you might find that there's different types of customers some that care about classic movies some of them care about modern movies and those might form hidden concepts whose discovery can really help you make better sense of the data. Next is what from? Every machine learning method is driven by some sort of target information that you care about. In supervised learning which is the subject of today's class we're given specific target labels and I give you examples just in a second. We also talk about unsupervised learning where target labels are missing and we use replacement principles to find, for example hidden concepts. Later there will be a class in reinforcement learning when an agent learns from feedback with the physical environment by interacting and trying actions and receiving some sort of evaluation from the environment like "Well done" or "That works." Again, we will talk about those in detail later. There's different things you could try to do with machine learning technique. You might care about prediction. For example you might want to care about what's going to happen with the future in the stockmarket for example. You might care to diagnose something which is you get data and you wish to explain it and you use machine learning for that. Sometimes your objective is to summarize something. For example if you read a long article your machine learning method might aim to produce a short article that summarizes the long article. And there's many, many, many more different things. You can talk about the how to learn. We use the word passive if your learning agent is just an observer and has no impact on the data itself. Otherwise, you call it active. Sometimes learning occurs online which means while the data is being generated and some of it is offline which means learning occurs after the data has been generated. There's different types of outputs of a machine learning algorithm. Today we'll talk about classification versus regression. In classification the output is binary or a fixed number of classes for example something is either a chair or not. Regression is continuous. The temperature might be 66.5 degrees in our prediction. And there's tons of internal details we will talk about. Just to name one. We will distinguish generative from discriminative. Generative seeks to model the data as generally as possible versus discriminative methods seek to distinguish data and this might sound like a superficial distinction but it has enormous ramification on the learning algorithm. Now to tell you the truth it took me many years to fully learn all these words here and I don't expect you to pick them all up in one class but you should as well know that they exist. And as they come up I'll emphasize them so you can resort any learning method I tell you back into the specific taxonomy over here.
The vast amount of work in the field falls into the area of supervised learning. In supervised learning you're given for each training example a feature vector and a target label named Y. For example, for a credit rating agency X1, X2, X3 might be a feature such as is the person employed? What is the salary of the person? Has the person previously defaulted on a credit card? And so on. And Y is a predictor whether the person is to default on the credit or not. Now machine learning is to be carried out on past data where the credit rating agency might have collected features just like these and actual occurances of default or not. What it wishes to produce is a function that allows us to predict future customers. So the new person comes in with a different feature vector. Can we predict as good as possible the functional relationship between these features X1 to Xn all the way to Y? You can apply the exact same example in image recognition where X might be pixels of images or it might be features of things found in images and Y might be a label that says whether a certain object is contained in an image or not. Now in supervised learning you're given many such examples. X21 to X2n leads to Y2 all way the index m. This is called your data. If we call each input vector Xm and we wish to find out the function given any Xm or any future vector X produces as close as possible my target signal Y. Now this isn't always possible and sometimes it's acceptable in fact preferable to tolerate a certain amount of error in your training data. But the subject of machine learning is to identify this function over here. And once you identify it you can use it for future Xs that weren't part of the training set to produce a prediction that hopefully is really, really good. So let me ask you a question. And this is a question for which I haven't given you the answer but I'd like to appeal to your intuition. Here's one data set where the X is one dimensionally plotted horizontally and the Y is vertically and suppose there looks like this. Suppose my machine learning algorithm gives me 2 hypotheses. One is this function over here which is a linear function and one is this function over here. I'd like to know which of the functions you find preferable as an explanation for the data. Is it function A? Or function B? Check here for A here for B and here for neither.
And I hope you guessed function A. Even though both perfectly describe the data B is much more complex than A. In fact, outside the data B seems to go to a minus infinity much faster than these data points and to plus infinity much faster with these data points over here. And in between we have wide oscillations that don't correspond to any data. So I would argue A is preferable. The reason why I asked this question is because of something called Occam's Razor. Occam can be spelled in many different ways. And what Occam says is that everything else being equal chose the less complex hypothesis. Now in practice there's actually a trade-off between a really good data fit and low complexity. Let me illustrate this to you by a hypothetical example. Consider the following graph where the horizontal axis graphs complexity of the solution. For example, if you use polynomials this might be a high-degree polynomial over here and maybe a linear function over here which is a low-degree polynomial your training data error tends to go like this. The more complex the hypothesis you allow the more you can just fit your data. However, in reality your generalization error on unknown data tends to go like this. It is the sum of the training data error and another function which is called the overfitting error. Not surprisingly the best complexity is obtained where the generalization error is minimum. There are methods to calculate the overfitting error. They go into a statistical field under the name Bayes variance methods. However, in practice you're often just given the training data error. You find if you don't find the model that minimizes the training data error but instead pushes back the complexity your algorithm tends to perform better and that is something we will study a little bit in this class. However, this slide is really important for anybody doing machine learning in practice. If you deal with data and you have ways to fit your data be aware that overfitting is a major source of poor performance of a machine learning algorithm. And I give you examples in just one second.
So a really important example of machine learning is SPAM detection. We all get way too much email and a good number of those are SPAM. Here are 3 examples of email. Dear Sir: First I must solicit your confidence in this transaction, this is by virtue of its nature being utterly confidential and top secret... This is likely SPAM. Here's another one. In upper caps. This is very likely SPAM. And here's another one. Oh, I know it's blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use. And so on and so on. Now this is likely not SPAM. How can a computer program distinguish between SPAM and not SPAM? Let's use this as an example to talk about machine learning for discrimination using Bayes Networks. In SPAM detection we get an email and we wish to categorize it either as SPAM in which case we don't even show as to the where or what we call HAM which is the technical word for an email worth passing on to the person being emailed. So the function over here is the function we're trying to learn. Most SPAM filters use human input. When you go through email you have a button called IS SPAM which allows you as a user to flag SPAM and occasionally you will say an email is SPAM. If you look at this you have a typical supervised machine learning situation where the input is an email and the output is whether you flag it as SPAM or if we don't flag it we just think it's HAM. Now to make this amenable to a machine learning algorithm we have to talk about how to represent emails. They're all using different words and different characters and they might have different graphics included. Let's pick a representation that's easy to process. And this representation is often called Bag of Words. Bag of Words is a representation of a document that just counts the frequency of words. If an email were to say Hello I will say Hello. The Bag of Words representation is the following. for the dictionary that contains the 4 words Hello I will say. Now look at the subtlety here. Rather than representing each individual word we have a count of each word and the count is oblivious to the order in which the words were stated. A Bag of Words representation relative to a fixed dictionary represents the counts of each word relative to the words in the dictionary. If you were to use a different dictionary like hello and good-bye our counts would be However, in most cases you make sure that all the words found in messages are actually included in the dictionary. So the dictionary might be very, very large. Let me make up an unofficial example of a few SPAM and a few HAM messages. Offer is secret. Click secret link. Secret sports link. Obviously those are contrived and I tried to retain the recovery to a small number of words to make this example workable. In practice we need thousands of such messages to get good information. Play sports today. Went play sports. Secret sports event. Sport is today. Sport costs money. My first quiz is What is the size of the vocabulary that contains all words in these messages? Please enter the value in this box over here.
Well let's count. Offer is secret click. Secret occurs over here already so we don't have to count it twice. Link, sports, play, today, went, event costs money. So the answer is There's 12 different words contained in these 8 messages.
[Narrator] Another quiz. What is the probability that a random message that arrives to fall into the spam bucket? Assuming that those messages are all drawn at random. [writing on page]
[Narrator] And the answer is: there's 8 different messages of which 3 are spam. So the maximum likelihood estimate is 3/8. [writing on paper]
So, let's look at this a little bit more formally and talk about maximum likelihood. Obviously, we're observing 8 messages: spam, spam, spam, and 5 times ham. And what we care about is what's our prior probability of spam that maximizes the likelihood of this data? So, let's assume we're going to assign a value of pi to this, and we wish to find the pi that maximizes the likelihood of this data over here, assuming that each email is drawn independently according to an identical distribution. The probability of the p(yi) data item is then pi if yi = spam, and 1 - pi if yi = ham. If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0, we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi. It's not that easy to see that this is equivalent, but say yi = 1. Then this term will fall out. It's not proficient by 1 because the exponent is zero, and we get pi as over here. If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here. Now assuming independence, we get for the entire data set that the joint probability of all data items is the product of the individual data items over here, which can now be written as follows: pi to the count of instances where yi = 1 times And we know in our example, this count over here is 3, and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th. We now wish to find the pi that maximizes this expression over here. We can also maximize the logarithm of this expression, which is 3 times log pi + 5 times log (1 - pi) Optimizing the log is the same as optimizing p because the log is monotonic to p. The maximum of this function is attained with a derivative of 0, so let's compute with a derivative and set it to 0. This is the derivative, 3 over pi - 5 over 1 - pi. We now bring this expression to the right side, multiply the denominators up, and sort all the expressions containing pi to the left, which gives us pi = 3/8, exactly the number we were at before. We just derived mathematically that the data likelihood maximizing number for the probability is indeed the empirical count, which means when we looked at this quiz before and we said a maximum likelihood for the prior probability of spam is 3/8, by simply counting 3 over 8 emails were spam, we actually followed proper mathematical principles to do maximum likelihood estimation. Now, you might not fully have gotten the derivation of this, and I recommend you to watch it again, but it's not that important for the progress in this class. So, here's another quiz. I'd like the maximum likelihood, or ML solutions, for the following probabilities. The probability that the word "secret" comes up, assuming that we already know a message is spam, and the probability that the same word "secret" comes up if we happen to know the message is not spam, it's ham.
And just as before we count the word secret in SPAM and in HAM as I've underlined here. Three out of 9 words in SPAM are the word secret so we have a third over here or 0.333 and only 1 out of all the 15 words in HAM are secret so you get a fifteenth or 0.0667.
By now, you might have recognized what we're really building up is a Bayes network where the parameters of the Bayes networks are estimated using supervised learning by a maximum likelihood estimator based on training data. The Bayes network has at its root an unobservable variable called spam, which is binary, and it has as many children as there are words in a message, where each word has an identical conditional distribution of the word occurrence given the class spam or not spam. If you write on our dictionary over here, you might remember the dictionary had 12 different words, so here is 5 of the 12, offer, is, secret, click and sports. Then for the spam class, we found the probability of secret given spam is 1/3, and we also found that the probability of secret given ham is 1/15, so here's a quiz. Assuming a vocabulary size of 12, or put differently, the dictionary has 12 words, how many parameters do we need to specify this Bayes network?
And the correct answer is 23. We need 1 parameter for the prior p (spam), and then we have 2 dictionary distributions of any word, i given spam, and the same for ham. Now, there's 12 words in a dictionary, but this distribution only needs 11 parameters, so 12 can be figured out because they have to add up to 1. And the same is true over here, so if you add all these together, we get 23.
So, here's a quiz. Let's assume we fit all the 23 parameters of the Bayes network as explained using maximum likelihood. Let's now do classification and see what class and message it ends up with. Let me start with a very simple message, and it contains a single word just to make it a little bit simpler. What's the probability that we classify this one word message as spam?
And the answer is 0.1667 or 3/18. How do I get there? Well, let's apply Bayes rule. This form is easily transformed into this expression over here, the probability of the message given spam times the prior probability of spam over the normalizer over here. Now, we know that the word "sports" occurs 1 in our 9 words of spam, and our prior probability for spam is 3/8, which gives us this expression over here. We now have to add the same probabilities for the class ham. "Sports" occurs 5 times out of 15 in the ham class, and the prior probability for ham is 5/8, which gives us 3/72 divided by 18/72, which is 3/18 or 1/6.
This gets to a more complicated quiz. Say the message now contains 3 words. "Secret is secret," not a particularly meaningful email, but the frequent occurrence of "secret" seems to suggest it might be spam. What's the probability you're going to judge this to be spam?
And the answer is surprisingly high. It's 25/26, or 0.9615. To see if we apply Bayes rule, which multiples the prior for spam-ness with the conditional probability of each word given spam. "Secret" carries 1/3, "is" 1/9, and "secret" 1/3 again. We normalize this by the same expression plus the probability for the non-spam case. "Secret" is 1/15. "Is" is 1/15, and "secret" again. This resolves to 1/216 over this expression plus 1/5400, and when you work it all out is 25/26.
The final quiz, let's assume our message is "Today is secret." And again, it might look like spam because the word "secret" occurs. I'd like you to compute for me the probability of spam given this message.
And surprisingly, the probability for this message to be spam is 0. It's not 0.001. It's flat 0. In other words, it's impossible, according to our model, that this text could be a spam message. Why is this? When we apply the same rule as before, we get the prior for spam which is 3/8. And we multiple the conditional for each word into this. For "secret," we know it to be 1/3. For "is," to be 1/9, but for today, it's 0. It's 0 because the maximum of the estimate for the probability of "today" in spam is 0. "Today" just never occurred in a spam message so far. Now, this 0 is troublesome because as we compute the outcome-- and I'm plugging in all the numbers as before-- none of the words matter anymore, just the 0 matters. So, we get 0 over something which is plain 0. Are we overfitting? You bet. We are clearly overfitting. It can't be that a single word determines the entire outcome of our analysis. The reason is that our model, to assign a probability of 0 for the word "today" to be in the class of spam is just too aggressive. Let's change this. One technique to deal with the overfitting problem is called Laplace smoothing. In maximum likelihood estimation, we assign towards our probability the quotient of the count of this specific event over all events in our data set. For example, for the prior probability, we found that 3/8 messages are spam. Therefore, our maximum likelihood estimate for the prior probability of spam was 3/8. In Laplace Smoothing, we use a different estimate. We add the value k to the count and normalize as if we added k to every single class that we've tried to estimate something over. This is equivalent to assuming we have a couple of fake training examples where we add k to each observation count. Now, if k equals 0, we get our maximum likelihood estimator. But if k is larger than 0 and n is finite, we get different answers. Let's say k equals 1, and let's assume we get one message, and that message was spam, so we're going to write it one message, one spam. What is p (spam) for the Laplace smoothing of k + 1? Let's do the same with 10 messages, and we get 6 spam. And 100 messages, of which 60 are spam. Please enter your numbers into the boxes over here.
The answer here is 2/3 or 0.667 and is computed as follows. We have 1 message with 1 as spam, but we're going to add k =1. We're going to add k = 2 over here because there's 2 different classes. K = 1 times 2 = 2, which gives us 2/3. The answer over here is 7/12. Again, we have 6/10 but we add 2 down here and 1 over here, so you get 7/12. And correspondingly, we get 61/102 is 60 + 1 over 100 +2. If we look at the numbers over here, we get 0.5833 and 0.5986. Interestingly, the maximum likelihood on the last 2 cases over here will give us .6, but we only get a value that's closer to .5, which is the effect of our smoothing prior for the Laplacian smoothing.
Let's use the Laplacian smoother with K=1 to calculate the few interesting probabilities-- P of SPAM, P of HAM, and then the probability of the words "today", given that it's in the SPAM class or the HAM class. And you might assume that our recovery size is about 12 different words here.
This one is easy to calculate for SPAM and HAM. For SPAM, it's 2/5, and the reason is, we had previously But thanks to the Laplacian smoother, we add 1 over here. And there are 2 classes, so we add 2 times 1 over here, which gives us 4/10, which is 2/5. Similarly to get 3/5 over here. Now the tricky part comes up over here. Before, we had 0 occurances of the word "today" in the SPAM class, and we had 9 data points. But now we are going to add 1 for Laplacian smoother, and down here, we are going to add 12. And the reason that we add 12 is because there's 12 different words in our dictionary Hence, for each word in the dictonary, we are going to add 1. So we have a total of 12, which gives us the 12 over here. That makes 1/21. In the HAM class, we had 2 occurrences of the word "today"--over here and over here. We add 1, normalize by 15, plus 12 for the dictionary size, which is 3/27 or 1/9. This was not an easy question.
We come now to the final quiz here, which is--I would like to compute the probability that the message "today is secret" falls into the SPAM box with Laplacian smoother using K=1. Please just enter your number over here. This is a non-trivia question. It might take you a while to calculate this.
In the approximate probabilities--0.4858. How did we get this? Well, the prior probability for SPAM under the Laplacian smoothing is 2/5. "Today" doesn't occur, but we have already calculated this to be 1/21. "Is" occurs once, so we get 2 over here over 21. "Secret" occurs 3 times, so we get a 4 over here over 21, and we normalize this by the same expression over here. Plus the prior for HAM, which is 3/5, we have 2 occurrences of "today", plus 1, equals 3/27. "Is" occurs once--2/27. And "secret" occurs once--again 2/27. When you work this all out, you get this number over here.
So we learned quite a bit. We learned about Naive Bayes as our first supervised learning methods. The setup was that we had features of documents or trading examples and labels. In this case, SPAM or not SPAM. And from those pieces, we made a generative model for the SPAM class and the non-SPAM class that described the condition of probability of each individual feature. We then used first maximum likelihood and then a Laplacian smoother to fit those primers over here. And then using Bayes rule, we could take any training examples over here and figure out what the class probability was over here. This is called a generative model in that the condition of probabilities all aim to maximize the probability of individual features as if those describe the physical world. We also used what is called a bag of words model, in which our representation of each email was such that we just counted the occurrences of words, irrespective of their order. Now this is a very powerful method for fighting SPAM. Unfortunately, it is not powerful enough. It turns out spammers know about Naive Bayes, and they've long learned to come up with messages that are fooling your SPAM filter if it uses Naive Bayes. So companies like Google and others have become much more involved in methods for SPAM filtering. Now I can give you some more examples how to filter SPAM, but all of those quite easily fit with the same Naive Bayes model.
[Narrator] So here features that you might consider when you write in an advance spam filter. For example, does the email come from a known spamming IP or computer? Have you emailed this person before? In which case it is less likely to be spam. Here's a powerful one: have 1000 other people recently received the same message? Is the email header consistent? So example if the from field says your bank is the IP address really your bank? Surprisingly is the email all caps? Strangely many spammers believe if you write things in all caps you'll pay more attention to it. Do the inline URLs point to those pages where they say they're pointing to? Are you addressed by your correct name? Now these are some features, I'm sure you can think of more. You can toss them easily into the naive base model and get better classification. In fact model spam filters keep learning as people flag emails as spam, and of course spammers keep learning as well and trying to fool modern spam filters. Who's going to win? Well so far the spam filters are clearly winning. Most of my spam I never see, but who knows what's going to happen with the future? It's a really fascinating machine learning problem.
[Narrator] Naive Bayes can also be applied to the problem of hand written digits recognition. This is a sample of hand-written digits taken from a U.S. postal data set where hand written zip codes on letters are being scanned and automatically classified. The machine-learning problem here is taken a symbol just like this. What is the corresponding number? Here it's obviously 0. Here it's obviously 1. Here it's obviously 2, 1. For the one down here, it's a little bit harder to tell. Now when you apply Naive Bayes, the input vector could be the pixel values of each individual pixel so we have a 16 x 16 input resolution. You would get 256 different values corresponding to the brightness of each pixel. Now obviously given sufficiently made training example, you might hope to recognize digits, but one of the deficiencies of this approach is it is not particularly shifted range. So for example a pattern like this will look fundamentally different from a pattern like this. Even though the pattern on the right is obtained by shifting the pattern on the left by 1 to the right. There's many different solutions, but a common one could be to use smoothing in a different way from the way we discussed it before. Instead of just counting 1 pixel value's count, you could mix it with counts of the neighboring pixel values so if all pixels are slightly shifted, we get about the same statistics as the pixel itself. Such a method is called input smoothing. You can what's technically called convolve the input vector equals pixel value variable, and you might get better results than if you do Naive Bayes on the raw pixels. Now to tell you the truth for digit recognition of this type, Naive Bayes is not a good choice. The conditional independence assumption of each pixel, given the class, is too strong an assumption in this case, but it's fun to talk about image recognition in the context of Naive Bayes regardless.
So, let me step back a step and talk a bit about overfitting prevention in machine learning because it's such an important topic. We talked about Occam's Razor, which in a generalized way suggests there is a tradeoff between how well we can fit the data and how smooth our learning algorithm is. In our class in smoothing, we already found 1 way to let Occam's Razor play, which is by selecting the value K to make our statistical counts smoother. I alluded to a similar way in the image recognition domain where we smoothed the image so the neighboring pixels count similar. This all raises the question of how to choose the smoothing parameter. So, in particular, in Laplacian smoothing, how to choose the K. There is a method called cross-validation which can help you find an answer. This method assumes there is plenty of training examples, but to tell you the truth, in spam filtering there is more than you'd ever want. Take your training data and divide it into 3 buckets. Train, cross-validate, and test. Typical ratios will be 80% goes into train, and 10% into test. You use the train to find all your parameters. For example, the probabilities of a base network. You use your cross-validation set to find the optimal K, and the way you do this is you train for different values of K, you observe how well the training model performs on the CV data, not touching the test data, and then you maximize over all the Ks to get the best performance on the cross-validation set. You iterate this many times until you find the best K. When you're done with the best K, you train again, and then finally only one you touch the test data to verify the performance, and this is the performance you report. It's really important in cross-validation split apart a cross-validation set that's different from the test set. If you were to use the test set to find the optimal K, then your test set becomes an effective part of your training routine, and you might overfit your test data, and you wouldn't even know. By keeping the test data separate from the beginning, and train on the training data, you use the cross-validation data to find how good your train data is doing, and the unknown parameters of K to fine-tune the K. Finally, only once you use the test data do you get a fair answer to the question, "How well will your model perform on future data?" So, pretty much everybody in machine learning uses this model. You can redo the split between training and the cross-validation part, people often use the word 10-fold cross-validation where they do 10 different forwardings and run the model 10 times to find the optimal K or smoothing parameter. No matter which way you do it, find the optimal smoothing parameter and then use a test set exactly once to verify in a report.
Let me back up a step further, and let's look at supervised learning more generally. Our example so far was one of classification. The characteristic of classifcation is that the target labels or the target class is discrete. In our case it was actually binary. In many problems, we try to predict a continuous quantity. For example, in the interval 0 to 1 or perhaps a real number. Those machine learning problems are called regression problems. Regression problems are fundamentally different from classification problems. For example, our base network doesn't afford us an answer to a problem where the target value could be at 0,1. A regression problem, for example, would be one to predict the weather tomorrow. Temperature is a continuous value. Our base number would not be able to predict the temperature, it only can predict discrete classes. A regression algorithm is able to give us a continuous prediction about the temperature tomorrow. So let's look at the regression next. So here's my first quiz for you on regression. This scatter plot shows for Berkeley California for a period of time the data for each house that was sold. Each dot is a sold house. It graphs the size of the house in square feet to the sales price in thousands of dollars. As you can see, roughly speaking, as the size of the house goes up, so does the sales price. I wonder, for a house of about 2500 square feet, what is the approximate sales price you would assume based just on the scatter plot data? Is it 400k, 600k, 800k, or 1000k?
My answer is, there seems to be a roughly linear relationship, maybe not quite linear, between the house size and the price. So we look at a linear graph that best describes the data-- you get this dashed line over here. And for the dashed line, if you walk up the 2500 square feet, you end up with roughly 800K. So this would have been the best answer.
Now obviously you can answer this question without understanding anything about regression. But what you find is this is different from classification as before. This is not a binary concept anymore of like expensive and cheap. It really is a relationship between two variables. One you care about--the house price, and one that you can observe, which is the house size in square feet. And your goal is to fit a curve that best explains the data. Once again, we have a case where we can play Occam's razor. There clearly is a data fit that is not linear that might be better, like this one over here. And when you go to hide the linear curves, you might even be inclined to draw a curve like this. Now of course the curve I'm drawing right now is likely an overfit. And you don't want to postulate that this is the general relationship between the size of a house and the sales price. So even though my black curve might describe the data better, the blue curve or the dashed linear curve over here might be a better explanation overture of Occam's razor. So let's look a little bit deeper into what we call regression. As in all regression problems, our data will be comprised of input vectors of length in that map to another continuous value. And we might be given a total of M data points. This is from the classification case, except this time the Ys are continuous. Once again, we're looking for function f that maps our vector x into y. In linear regression, the function has a particular form which is W1 times X plus W0. In this case X is one dimensional which is N = 1. Or in the high-dimensional space, we might just write W times X plus W0, where W is a vector and X is a vector. And this is the inner product of these 2 vectors over here. Let's for now just consider the one-dimensional case. In this quiz, I've given you a linear regression form with 2 unknown parameters, W1 and W0. I've given you a data set. And this data set happens to be fittable by a linear regression model without any residual error. Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are?
This is a suprisingly challenging question. If you look at these numbers from 3 to 6. When we increase X by 3, Y decreases by 3, which suggests W1 is -1. Now let's see if this holds. If we increase X by 3, it decreases Y by 3. If we increase X by 1, we decrease Y by 1. If we increase X by 2, we decrease Y by 2. So this number seems to be an exact fit. Next we have to get the constant W0 right. For X = 3, we get -3 as an expression over here, because we know W1 = -1. So if this has to equal zero in the end, then W0 has to be 3. Let's do a quick check. -3 plus 3 is 0. -6 plus 3 is -3. And if we plug in any of the numbers, you find those are correct. Now this is the case of an exact data set. It gets much more challenging if the data set cannot be fit with a linear function.
To define linear regression, we need to understand what we are trying to minimize. The word is called here, are loss function and the loss function is the amount of residual error we obtain after fitting the linear function as good as possible. The residual error is the sum of all training examples, J of YJ, which is the target label, minus our prediction, which is W1 XJ minus W0 to the square. This is the quadratic error between our target tables and what our best hypothesis can produce. The minimizing of loss is used for linear regression of a new regression problem, and you can write it as follows: Our solution to the regression problem W is the arg min of the loss over all possible vectors W.
The problem of minimizing quadratic loss for linear functions can be solved in closed form. When I reduce, I will do this for the one-dimensional case on paper. I will also give you the solution for the case where your input space is multidimensional, which is often called "multivariant regression." We seek to minimize a sum of a quadratic expression where the target labels are subtracted with the output of our linear regression model parameterized by w1 and w2. The summation here is overall training examples, and I leave the index of the summation out if not necessary. The minimum of this is obtained where the derivative of this function equals zero. Let's call this function "L." For the partial derivative with respect to w0, we get this expression over here, which we have to set to zero. We can easily get rid of the -2 and transform this as follows: Here M is the number of training examples. This expression over here gives us w0 as a function of w1, but we don't know w1. Let's do the same trick for w1 and set this to zero as well, which gets us the expression over here. We can now plug in the w0 over here into this expression over here and obtain this expression over here, which looks really involved but is relatively straightforward. With a few steps of further calculation, which I'll spare you for now, we get for w1 the following important formula: This is the final quotient for w1, where we take the number of training examples times of the sum of all xy minus the sum of x times the sum of y divided by this expression over here. Once we've computed w1, we can go back to our original articulation of w0 over here and plug w1 into w0 and obtain w0. These are the two important formulas we can also find in the textbook. I'd like to go back and use those formulas to calculate these two coefficients over here. You get 4 times the sum of x and the sum of y, which is -32 minus the product of the sum of x, which is 18, and the sum of y, which is -6, divided by the sum of x squared, which is 86, times 4, minus the sum of x squared, which is 18 times 18, which is 324. If you work this all out, it becomes -1, which is w1. W0 is now obtained by completing the quarter times sum of all y, which is -6, minus -1/4 times sum of all x. If you plug this all in, you get 3, as over here. Our formula is actually correct. Here is another quiz for linear regression. We have the follow data: Here is the data plotted graphically. I wonder what the best regression is. Give me w0 and w1. Apply the formulas I just gave you.
And the answer is W0 = 0.5, and W1 = 0.9. If I were to draw a line, it would go about like this. It doesn't really hit the two points at the end. If you were thinking of something like this, you were wrong. If you draw a curve like this, your quadratic error becomes 2. One over here, and one over here. The quadratic error is smaller for the line that goes in between those points. This is easily seen by computing as shown in the previous slide. W1 equals (4 x 118 - 20 x 20) / (4 x 120 - 400) which is 0.9. This is merely plugging in those numbers into the formulas I gave you. W0 then becomes ¼ x 20. Now we plug in W1-- 0.9 / 4 x 20 equals 0.5. This is an example of linear regression, in which case there is a residual error, and the best-fitting curve is the one that minimizes the total of the residual vertical error in this graph over here.
So linear regression works well if the data is approximately linear, but there are many examples when linear regression performs poorly. Here's one where we have a curve that is really nonlinear. This is an interesting one where we seem to have a linear relationship that is flatter than the linear regression indicates, but there is one outlier. Because if you are minimizing quadratic error, outliers penalize you over-proportionately. So outliers are particularly bad for linear regression. And here is a case, where the data clearly suggests a very different phenomena for linear. We have only two ?? variables even being used, and this one has a strong frequency and a strong vertical spread. Clearly a linear regression model is a very poor one to explain this data over here. Another problem with linear regression is that as you go to infinity in the X space, your Ys also become infinite. In some problems that isn't a plausible model. For example, if you wish to predict the weather anytime into the future, it's implausible to assume the further the prediction goes out, the hotter or the cooler it becomes. For such situations there is a model called logistic regression, which uses a slightly more complicated model than linear regression, which goes as follows:. Let F of XP, or linear function, and the output of logistic regression is obtained by the following function: One over one plus exponential of minus F of X. So here's a quick quiz for you. What is the range in which Z might fall given this function over here, and ??? the linear function of F or X over here. Is it zero, one? Is it minus one, one? Is it minus one, zero? Minus two, two? Or none of the above?
The answer is zero, one. If this expression over here, F of X, grows to positive infinity, then Z becomes one. And the reason is as this term over here becomes very large, E to the minus of that term approaches zero; one over one equals one. If F of X goes to minus infinity, then Z goes to zero. And the reason is, if this expression over here goes to minus infinity, E to the infinity becomes very large; one over something very large becomes zero. When we plot the logistic function it looks like this: So it's approximately linear around F of X equals zero, but it levels off to zero and one as we go to the extremes.
Another problem with linear regression has to do with the regularization or complexity control. Just like before, we sometimes wish to have a less complex model. So in regularization, the loss function is either the sum of the loss of a data function and a complexity control term, which is often called the loss of the parameters. The loss of the data is simply curvatic loss, as we discussed before. The loss of parameters might just be a function that penalizes the parameters to become large up to some known P, where P is usually either 1 or 2. If you draw this graphically, in a parameter space comprised of 2 parameters, your curvatic term for minimizing the data error might look like this, where the minimum sits over here. Your term for regularization might pull these parameters toward 0. It pulls it toward 0, along the circle if you use curvatic error, and it does it in a diamond-shaped way. For L1 regularization--either one works well. L1 has the advantage in that parameters tend to get really sparse. If you look at this diagram, there is a tradeoff between W-0 and W-1. In the L1 case, that allows one of them to be driven to 0. In the L2 case, parameters tend not to be as sparse. So L1 is often preferred.
This all raises the question, how to minimize more complicated loss functions than the one we discussed so far. Are there closed-form solutions of the type we found for linear regression? Or do we have to resort to iterative methods? The general answer is, unfortunantly, we have to resort to iterative methods. Even though there are special cases in which corresponding solutions may exist, in general, our loss functions now become complicated enough that all we can do is iterate. Here is a prototypical loss function and the method for interation will be called gradient descent. In gradient descent, you start with an initial guess, W-0, where 0 is your iteration number, and then you up with it iteratively. Your i plus 1st parameter guess will be obtained by taking your i-th guess and subtracting from it the gradient of your loss function, and that guess multiplied by a small learning weight alpha, where alpha is often as small as 0.01. I have a couple of questions for you. Consider the following 3 points. We call them A, B, C. I wish to know, for points A, B, and C, Is the gradient at this point positive, about zero, or negative? For each of those, check exactly one of those cases.
In case A, the gradient is negative. If you move to the right in the X space, then your loss decreases. In B, it's about zero. In C, it's pointing up; it's positive. So if you apply the rule over here, if you were to start at A as your W-zero, then your gradient is negative. Therefore, you would add something to the value of W. You move to the right, and your loss has decreased. You do this until you find yourself with what's called a local minimum, where B resides. In this instance over here, gradient descent starting at A would not get you to the global minimum, which sits over here because there's a bump in between. Gradient methods are known to be subject to local minimum.
I have another gradient quiz. Consider the following quadratic arrow function. We are considering the gradient in 3 different places. a. b. and c. And they ask you which gradient is the largest. a, b, or c or are they all equal? In which case, you would want to check the last box over here
And the answer is C. The derivative of a quadratic function is a linear function. Which would look about like this. And as we go outside, our gradient becomes larger and larger. This over here is much steeper than this curve over here.
[Thrun] Here is a final gradient descent quiz. Suppose we have a loss function like this and our gradient descent starts over here. Will it likely reach the global minimum? Yes or no. Please check one of those boxes.
[Thrun] And the answer is yes, although, technically speaking, to reach the absolute global minimum we need the learning rates to become smaller and smaller over time. If they stay constant, there is a chance this thing might bounce around between 2 points in the end and never reach the global minimum. But assuming that we implement gradient descent correctly, we will finally reach the global minimum. That's not the case if you start over here, where we can get stuck over here and settle for the minimum over here, which is a local minimum and not the best solution to our optimization problem. So one of the important points to take away from this is gradient descent is universally applicable to more complicated problems-- problems that don't have a plausible solution. But you have to check whether there is many local minima, and if so, you have to worry about this. Any optimization book can tell you tricks how to overcome this. I won't go into any more depth here in this class.
[Thrun] It's interesting to see how to minimize a loss function using gradient descent. In our linear case, we have L equals sum over the correct labels minus our linear function to the square, which we seek to minimize. We already know that this has a closed form solution, but just for the fun of it, let's look at gradient descent. The gradient of L with respect to W1 is minus 2, sum of all J of the difference as before but without the square times Xj. The gradient with respect to W0 is very similar. So in gradient descent we start with W1 0 and W0 0 where the upper cap 0 corresponds to the iteration index of gradient descent. And then we iterate. In the M iteration we get our new estimate by using the old estimate minus a learning rate of this gradient over here taking the position of the old estimate W1, M minus 1. Similarly, for W0 we get this expression over here. And these expressions look nasty, but what it really means is we subtract an expression like this every time we do gradient descent from W1 and an expression like this every time we do gradient descent from W0, which is easy to implement, and that implements gradient descent.
Now, there are many different ways to apply linear functions in machine learning. We so far have studied linear functions for regression, but linear functions are also used for classification, and specifically for an algorithm called the perceptron algorithm. This algorithm happens to be a very early model of a neuron, as in the neurons we have in our brains, and was invented in the 1940s. Suppose we give a data set of positive samples and negative samples. A linear separator is a linear equation that separates positive from negative examples. Obviously, not all sets possess a linear separator, but some do. For those we can define the algorithm of the perceptron and it actually converges. To define a linear separator, let's start with our linear equation as before-- w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind. If this is larger or equal to zero, then we call our classification 1. Otherwise, we call it zero. Here's our linear separation classification function where this is our common linear function. Now, as I said, perceptron only converges if the data is linearly separable, and then it converges to a linear separation of the data, which is quite amazing. Perceptron is an iterative algorithm that is not dissimilar from grade descent. In fact, the update rule echoes that of grade descent, and here's how it goes. We start with a random guess for w1 and w0, which may correspond to a random separation line, but usually is inaccurate. Then the mth weight-i is obtained by using the old weight plus some learning rate alpha times the difference between the desired target label and the target label produced by our function at the point m-1. Now, this is an online learning rule, which is we don't process all the data in batch. We process one data at a time, and we might go through the data many, many times-- hence the j over here-- but every time we do this, we apply this rule over here. What this rule gives us is a method to adapt our weights in proportion to the error. If the prediction of our function f equals our target label, and the error is zero, then no update occurs. If there is a difference, however, we update in a way so as to minimize the error. Alpha is a small learning weight. Once again, perceptron converges to a correct linear separator if such linear separator exists. Now, the case of linear separation has recently received a lot of attention in machine learning. If you look at the picture over here, you'll find there are many different linear separators. There is one over here. There is one over here. There is one over here. One of the questions that has recently been researched extensively is which one to prefer. Is it a, b, or c? Even though you probably have never seen this literature, I will just ask your intuition in this following quiz. Which linear separator would you prefer if you look at these three different linear separators-- a, b, c, or none of them?
As the final method in this lesson, I'd like now to talk about k-nearest neighbors. And the distinguishing factor of k-nearest neighbors is that it is a nonparametric machine learning method. So far we've talked about parametric methods. Parametric methods have parameters, like probabilities or weights, and the number of parameters is constant. Or to put it differently, the number of parameters is independent of the training set size. So for example in the Naive Bayes, if we bring up more data, the number of condition probabilities will stay the same. Well, that wasn't technically always the case. Our vocabulary might increase and as such the number of parameters. But for any fixed dictionary, the number of parameters are truly independent of the training set size. The same was true, for example, in our regression cases where the number of regression weights is independent of the number of data points. Now this is very different from non-parametric where the number of parameters can grow. In fact, it can grow a lot over time. Those techniques are called non-parametric. Nearest neighbor is so straightforward. I'd really like to introduce you using a quiz. So here's my quiz. Suppose we have a number of data points. I want you for 1-nearest neighbor to check those squared areas that you believe will carry a positive label. And I will give you the label of the existing data points. So please check any of those boxes that you believe are now And the algorithm, of course, searches for the nearest point in this Euclidean space and just copies its label.
And the answer was: This is a positive point, and this is a positive point. These 2 points over here are negative. So let's define k-nearest neighbors. The algorithm is really blatantly simple. In the learning step, you simply memorize all data. If a new example comes along with the input value you know but which you wish to classify, you do the following. You first find the k-nearest neighbors. And then you return the majority class label as your final class label for the new example. Simple, isn't it? So here's a somewhat contrived situation of the data point we wish to classify where the label data lies on the spiral of increasing diameter as it goes outwards. Please answer for me in this quiz what class label you'd assign for k = 1, k = 3, 5, 7, and all the way to 9.
And this is an easy answer. The nearest neighbor is this point over here, so for 1 we say plus. For 3 neighbors, we get 2 positive, 1 negative. It's still plus. For 5 neighbors--1, 2, 3, 4, 5-- we get 3 negative, 2 positive. It's a minus. For 7, we get 3 positive but still 4 negative, so it's negative. And for 9, the positives outweigh the negative, so you get a plus. Obviously, as you can see, as K increases, more and more data points are being consulted. So when K finally becomes 9, all those data points are in and make a much smoother result.
Just as in the Laplacian smoothing example before, the value of k is a smoothing parameter. It makes the function less scattered. Here is an example of k=1 for a 2-class nearest neighbor problem. You can see the separation boundary is what's called a Voronoi diagram between the positive and negative class, and in cases where there is noise between these class boundaries, you'll find really funny, complex boundaries as indicated over here. Particularly interesting is this guy over here where the class of this circle over here protrudes way into the otherwise solid class. Now, as you go to k=3, you get this graph over here, which is smoother. So if you are over here, your two nearest neighbors are of this type over there, and you get a uniform class over here. In this region over here, you get uniform classes as solid classes as shown over here. The more you drive up k, the more clean this decision boundary becomes, but the more outliers are actually misclassified as well. So if I go back to my k-nearest neighbor method, we just learned that k is a regularizer. It controls the complexity of the k-nearest neighbor algorithm. and the larger k is, the smoother the output. We can, once again, use cross-validation to find the optimal k because there is an inherent trade off--between the complexity of what we want to fit and the goodness of the fit.
What are the problems of kNN? Well, I would argue that there're two. One is very large data sets, and one is very large feature spaces. Now the first one results in lengthy searches when you try to find K's nearest neighbors. Now, fortunately there are methods to search efficiently. Often you represent your data not by a linear list, in which case the search would be linear in the number of data points, but by a tree, where the search becomes logarithmic. The method of choice is called kDD trees where there are many other ways to represent data points as trees. Now very large feature spaces cause more of a problem. It turns out computing nearest neighbors, as the feature space for the input vector increases, becomes increasingly difficult, and the tree methods become increasingly brittle. And the reason is shown in the following graph: If your graph input dimension to the average edge length of your neighborhood you'll find that for randomly chosen points very quickly all points are really far away. The edge length of one is obtained if your query point is lesson one away from the nearest neighbor. If you have one hundred dimensions, that is almost certain. Why is that? Well, in one hundred dimensions, they are to be one where just by chance your're far away. The number of points you need to get something close grows exponentially with the number of dimensions. So, for any fixed data set size you will find yourself in a situation where all your neighbors are far away. Nearest neighbor works really well for small input spaces like three or four dimensions. It works very poorly if your input space is twenty, twenty-five, or maybe one hundred dimensions. So don't trust nearest neighbor to do a good job if your input and measure spaces are high.
So congratulations. You've just learned a lot about machine learning. We focused on supervised machine learning which deals with situations where you have input vectors and given output labels and your goal is to predict the output label from an input vector. And we looked into parametric models like Naive Bayes non-parametric models. We talked about classification where the output is discrete versus regression where the output is continuous and we looked at samples of techniques for each of these situations. Now obviously we just scratched the surface on machine learning. There's books written about it and courses taught about it. Machine learning is a super fascinating topic. It's the one within the artificial intelligence I love the most. It's really great about the real world as we gain more data like the world wide web or medical data sets or financial data sets. Machine learning is poised to become more and more important. I hope that the things you learned in this class so far really excite you and entice you to apply machine learning to problems that you face in your professional life.