These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
So welcome back to statistics 101. Today, we're going to learn something insanely cool.You might remember from the very beginning of this class that we had data sets that involve more than one dimension.For example, the size of a house is relative to its price, and initially, we talked about scatter plots,we talked about bar charts but we didn't talk about what's the holy grail of statistics,and that is fitting a line through the data points.After this unit, you'll be able to fit a line to data points like those and you'll even be able to tell me what the residual error is in that fit, and that allows us not just to understand the data but also to make predictions about points we've never seen before.So let's dive straight in--so let's talk about lines and let's talk about the technology to fit lines to data called linear regression.It has two dimensional data such as the age of a person and the person's income and this is obviously made up data.Then linear regression tries to fit a line that best describes the data.So how do we specify a line--suppose we call the horizontal x as x and suppose we call the vertical axis y then the lines commonly described with a functional relationship between x and y of the following form y=bx+a.So let us look at lines. Let's take this simple example y=2x. Let's add units to the axes of coordinate system and let me draw 3 different examples of lines.The blue line, the green line, and the red line and for this example y=2x. Can you imagine which of the three lines would be described for this equation.Please check one of these three boxes.
And let's look at this--if x=0 then y=0. So he has to go through the origin.Let's takes out the red hypothesis over here.Now if x=2, then y is 4. That means the line goes to the point 2, 4 and of course, 1, 2.That makes the green line the correct choice here.
And just for kicks, let's try this again with a different equation--a and b are now -1 for b and 4 for a that describes a different line.I'm going to give a couple of choices. Here's one, here's another one, and here's the third.Which one is best reflective of this equation over here?
And I hope you got this right. When x=0, then y should be 4, which is the point over here.If x=4, then -4 + 4 cancels out. Y = 0--it's the point over here. So red would have been the correct answer.
A final quiz, now we get to determine what a and b is in the line of 1 where x=4 assumes the value y=2.Now if x=0, it assumes the value of y equals what? What's a and what's b for the blue line?
Well, you always start with a, because you can get aby looking at the case of x equals 0 and then b doesn't matter.So if x equals 0, so it means you write over here, then y equals a that falls us directly from this question because x equals 0.So what's the y value over here? It's 1 so that means a is to be 1.Now that we know this, you can look at b.We know that if x = 4 then the expression on the right over here gets us a 2 in the y dimension.We really know a is you can put the a in which gives us 4b+1=2. We bring 1 to the right side.Therefore, 4b=1 and then we divided by 4, which means b equals to fourth or quarter and 0.25 and 1 other correct answers here.So now you understand about lines. Let's talk about linear regression.
In linear regression, we are given data and data has still has more than one dimension. We have just studied for two dimensions throughout this class.So this might be the data and we're looking for the best line that fits the data.We'll put differently the parameters a and b that we discussed before that defines the line.The word best is interesting.Obviously, it's impossible to put a line through the data points over here.These are what's called nonlinear data that go up and down.Data often looks like this even if the relationship between x and y is linear--that's usually what's called noise in the data, some amount of randomness you can't explain.So in finding the best fit, we're trying to find a line that minimizes the difference between the data and the line in the y direction and that's somewhat counter intuitive.You'd normally think the distance is given by these lines over here in red and that is the shortest distance irrespective of x and y.But in linear regression, it turns out that we're minimizing the distance just in the y direction.And there's a lot of theory why this is a good idea.In summary, we assume that our data is the result of this blank some unknown linear function,bx+a+noise and if this noise is assumed to be Gaussian, they're minimizing the quadratic deviation between the data points and the line happens to be the correct mathematical answer.But leaving this aside, what we're doing is we are adding over all data points the difference between our function and the y value of the data point with the square and that distance is the distance we're minimizing.So let's look into this with examples--for the following four data points,which of these lines has the smallest quadratic error and would be the best regression line--pick one of the three.
In this case, it's easy--it's this one, because it seems to describe the data almost perfectly whereas the red one has substantially an error as does the green curve even more so.Let's look at more data, which line would you pick?The one in the middle, the red one, the blue one, or perhaps any of those.Perhaps, it is all equally good. It turns out green is the right answer. Let's just look at this.The blue one suffers no loss for the three points over here but a fairly substantial loss for the other three data points.Let's call this loss c.So, the blue curve would have an error of 3c²--if this distance here is called c.The reason why it's squared is because we have three of those and we're using the quadratic distance.Similarly, the red one has the same problem, 3c² for the red one.Now, how about the green one? For the green one, we have errors for all six points.But now, the amount of the error itself is half as big as c. So, we're going to write (c/2)².And when you work this out, you'll find that this is 6/4 c² is the same as 3/2 c².So, the total error for the green one, the quadratic error,is half as big as it is for the blue one and that's a surprise.It has to do with the fact that in a quadratic version of the error,large errors count much, much more than small errors.So, green is the best regression line we can find for this data over here among the choices I've given you.
Let's look at another data set reaching x and y, and now I'm going to ask you a different question,which is the parameter b, is it negative or positive?Assuming that this is 0, 0, and say this is 10, 10.I ask you the same for the parameter a. Is this negative of positive?In either case choose exactly one answer.
And once again, a must be positive. If we draw the line, a is the value of y intersection, where x=0.Remember, all linear functions are performed bx+a, so if x is equals 0,and this term falls out, and we have y=a, so this is the value of y.It must be positive, it's above 0 and b happens to be negative.For any graph that's from left to right that goes down, b is negative.It's easily seen as x increases, as you make x larger, you make y smaller.That's a negative correlation as we'll call it in the future unit but the graph goes down.That's good. Now we understand a lot about this formula over here.
Y = bx + a, the key holy grail in linear regression and in much of statistics I show to use data to determine the value of b and the value of a.So if you can do this with data, then we solve the problem of fitting the best line and that's once again called linear regression.So I won't give you the derivation but I'll give you the formula. Let's start with b.Assuming your data comes in pairs for x and y as indicated here and the formula for b might look really complex at first but I promise you it isn't. Just in the calculations of the variance, we take the difference between x and the mean of x,but rather than squaring this as in the variance case, we now do a product with the same term for the y's.So important here, this notation x-bar is the mean of the Xi and x-bar is the mean of the Yi. Previously, we would have called x-bar µbut now that we have two variables we're going to use the bar notation.Let's go back and look at this formula here.When we computed the variance, we would have taken (Xi - X-bar)²but here we're taking the product of Xi - X-bar multiplied by the y direction of the same thing.So it's similar to computing a variance. The last thing we do is normalize this thing.Before we often normalize with n but now we're going to normalize with something else.We normalize with the term that very much reminds us of the variance.
Here's our formula again. Let's assume I have the following the data.There's an x variable, y variable, and we have four data points.6 goes to 7, 2 to 3, 1 to 2, -1 to 0, and very quickly we can just look at the data diagram.In this case, a scatter plot. So here's the x axis and here's the y axis.And if I plug the data, 6 goes to 7, just about here, 2 to 3, 1 to 2, and 1- to 0,not surprisingly this data is in fact, linear.Whatever nonlinear you see is because of my poor drawing skills.And if you analyze carefully, you already saw that y is always respond larger than x.So before we apply the formula, let's just quickly guess what b and a would be,if y is always found larger than x. What are the coefficients of b and a?
And it's 1 and 1. So, the way I get these numbers, 1 and 1, is really easy.I just write down that y is always larger than x by 1.When you look at this, that just means a = 1 and b = 1,because you can think over this 1x instead of just x over here, and this gives us b equals 1.
But, I'm now going to do this with the formula up here.In particular, we have four things we add in the denominator and also four things we add in the numerator.It's four because there are four data points over here.Before we fill in this data, I'm going to ask you a different question,one that is easy for you to answer, and that is what are the means for x and for y in this data set.You will need them when you fill in these blanks over here.So, please give me just the two numbers here in green.
There is a bar missing from the in the denominator: .
So the denominator is equal to:. Where is the mean of and are the values of .
The axes add up to 8 and then 8/4 is 2; hence, the mean for x is 2, and for y, you find it to be 3.They add up to 12/4 is 3. So these are the means.
Let's now fill in the very first value on the left over here, which is for the first data point.This expression shown over here. Please go ahead and fill in the first value.
And my solution is 16 and the reason why is 6-2 is 4. So the first expression results to 4. And the second expression, the y expression also results to 4, 7-3 is 4, so 44 make 16.
Let's now go and add the three next values over here.
To me, the answer is 0 for the first one, 2-2=0. So the first expression is 0 and the second one is going to be 0 as well.The next one is 1, 1-2=-1. For the second expression, 2-3=-1. -1-1 gives us a 1.The last one would be 9, -1-2 is -3 for the first expression. 0-3 is also -3 so we get a -3-3=9.
Now, I want you to add the values down here and those would look really familiar because you've done so many versions of it when completing variances from data.Go ahead fill them in.
And surprisingly, they happen to be exactly the same y as in the top--y.However, this specific data set, you get for example that 6-2 this expression over here is the same as 7-3.If we look at this carefully and work these things out, we find that 16, 0, 1, 9 set are answers.
There is a bar missing from the in the denominator: .
So the denominator is equal to:. Where is the mean of and are the values of . Therefore:
What do we get for B? Please enter the answer here.
And the answer is one 1 as we expected.Now that happens to be a trivial example of a two reasons.One is b=1 is the most simple case of the linear relationship, but secondly, there is no noise.This equation of b = 1 will fit to data exactly, but we did it to exercise the use of the formula.
The final thing I want to teach you is how to calculate a.There is a simple formula as well, and the insight here is you already know band we know that bx+a results to y for all linear function.Now, what happens to be true is if we take the average x and the average y,then this equation is still correct.Now, we can resort the terms--a=y-bar - b-bar, so that's the formula for a.Now, the formula for b and we have the formula for a. That's really cool.Let's apply this here and see what your value for a would be.Just give me the answer based on plugging the means into the formula over here using the appropriate b.
And the answer is 1. The mean of x is 2, the mean of y is 3, b=1 3-2 makes 1.So we've recovered using these formulas, but we kind of already knew about the data,mainly that y is x+1--that is a=1 and b=1 because there's exactly 1 x over here in this equation, so let's make this more complicated.
And you now can give me easily the value for a in this box.
And the value here is 15.