**These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!**

**Please check the wiki guide for some tips on wiki editing.**

Contents

- 1 30. Correlation
- 1.1 01 Introducing Correlation
- 1.2 02 Introducing Correlation Solution
- 1.3 03 Correlation From Regression
- 1.4 04 Correlation From Regression Solution
- 1.5 05 Correlation Formula
- 1.6 06 Compute Correlation 1
- 1.7 07 Compute Correlation 1 Solution
- 1.8 08 Compute Correlation 2
- 1.9 09 Compute Correlation 2 Solution
- 1.10 10 Compute Correlation 3
- 1.11 11 Compute Correlation 3 Solution
- 1.12 12 Guess R
- 1.13 13 Guess R Solution
- 1.14 14 Compute Actual 1
- 1.15 15 Compute Actual 1 Solution
- 1.16 16 Compute Actual 2
- 1.17 17 Compute Actual 2 Solution
- 1.18 18 Compute Actual 3
- 1.19 19 Compute Actual 3 Solution
- 1.20 20 Compute Actual 4
- 1.21 21 Compute Actual 4 Solution
- 1.22 22 Another Example 1
- 1.23 23 Another Example 1 Solution
- 1.24 24 Another Example 2
- 1.25 25 Another Example 2 Solution
- 1.26 26 Another Example 3
- 1.27 27 Another Example 3 Solution
- 1.28 28 Another Example 4
- 1.29 29 Another Example 4 Solution
- 1.30 30 R Intuition
- 1.31 31 R Intuition Solution
- 1.32 32 Final Example 1
- 1.33 33 Final Example 1 Solution
- 1.34 34 Final Example 2
- 1.35 35 Final Example 2 Solution
- 1.36 36 Final Example 3
- 1.37 37 Final Example 3 Solution
- 1.38 38 Final Example 4
- 1.39 39 Final Example 4 Solution
- 1.40 40 Summary

In this unit, I'll teach you about a term called correlation.It's an important statistical term and in the end of this unit, you'll be able to use it yourself.Here's the fundamental problem.Sometimes, there are lines done very nicely like these points over here.Other times, two variables seem to be utterly unrelated.Correlation is a measure that lies within -1 all the way to 1.That tells us how far the data is described by a line.In both cases, you can fit a line but in one caseit would be a really good description of the data whereas in the others don't.So the correlation coefficient of what we call r is 1 if the data is perfectly aligned for the line.It's 0 if there seems to be no relation between the two different axes and the data, and it can also be -1.In the case where the data in fact is still perfectly aligned but there's a negative relationshipbetween one variable and the other variable.Let me see if you got this.Here are three data sets--in one, we have a strongly positive correlation; in another,it's above zero, and in the third one, it will be a negative correlation.Which of these cases best describe those conditionsand use each condition exactly once here.

And the answers goes like this--the reason being that here we have a clear line.In fact, r will probably be one and it's positive. Both variables grow at the same time.With this U shape, there might be a dependence but it's not linear.We fit the best line, it's going to be flat, and in this flat linethere is really no dependence between the x and the y variable.So definitely knowing x does not tell you about y and vice versa that tends to be r=0and in this last case, there is a negative line.The best line fit will go something like this and r might be as small as -0.2, but it is negative,the reason being that the line down here goes down negatively.

Let me ask a different quiz--suppose you run linear regression, and you found that b=4 and a=-3.To remind you, this describes the following linear relationship between x and y.My question is about the correlation coefficient r.Will r be positive, negative, zero, or can't we tell as in can't tell?Check exactly one box.

I should say this is not an easy question. Any answer is positive.If you look at whatever the data was, this is roughly how this line looks like,and that means in the data, there is a tendency as you increase xthere would be an increase of y, so that means the correlation will be positive.We can't tell what value it is. If the data fits exactly onto the line, then it would be one.It could also be that the data has enormous deviation from this line,and this is the best fitting line, in which casethe correlation coefficient will be still larger than zero but it might be much closer to zero,depending on the amount of deviation of those points from the line.

To summarize the correlation coefficient, which you're just about to learn about,is a value between -1 and 1, tells us how related or correlated two variables areand both 1 and -1 stand for perfectly linear data.In the case of +1, we know that this line increases in x and y simultaneouslywhereas for -1, we have the inverse effect.Let's compute r--my favorite way to compute it is very similarto the way we computed b in linear regression.It looks like the sum of all data points and takes the product of (xi-x-bar)and multiplies for each data point this with (yi - y-bar) and then, we have to normalize.This could be any value. It isn't between ±1.We normalize by a √(x-x-bar)² sum of all i's times the same expression for y.Now this looks a little bit wild, and this is probably the worst formulayou've encountered in this class in terms of complexity but once you dived in,you'll realize this is really related to a lot of stuff you've seen before such as variances and similar.This is the quintessential term that occurs in the variance calculation of x.All that's missing is the normalizer.Same over here, this is the variance of y--this thing is the normalizer.We take the product of the variance of x and the variance of y modally with the normalizersand we get something quadratic even in variance space.This over here is kind of a mixed variance. This is often called the covariance.But notice that there is also a normalizer in this thing over here.In fact, the missing normalizers on top and bottom of this barcan't see each other out; hence, I just omitted them.But this one is just like the variance calculation but it mixes x's and y's whereas these are x² and y².This is called often covariance if you've normalized, becauseit is the variance calculation of a two co-occuring variables.These are the variances.So what this really tells you is kind of the ratio how much these two things co-evolve,how much the errors correspond versus normalized by the multitudes of errors individuallyand whether the ratio becomes 1, we have a perfect correlation.When the ratio becomes 0, then the numerator is 0 which means our errors cancel each other out.That is very different for x and for y under any linear model.So this complicated formula is what's called the correlation coefficient r. So let's try this out.

Let me give you a data set. x=3, 4, 5 and for those x's we get 7, 8, and 9.The first data item would be 3, 7. Second 4, 8. Third 5, 9.It's easy to see that the mean x-bar is 4, mean y-bar is 8and this gives us x - x-bar and y - y-bar, the new numbers -1, 0, 1 and -1, 0, 1 here again.So these are three mean and normalized data points.Let's now compute these three values over here for this example. Give me the first one.

The answer is 2--if we multiply the data point the first expression and the second expressionget -1*-1 which is 1, 0, and 1 again that adds up to 2.*

Please calculate the expression over here and put it in this box.

And the answer is 2 again. -1² +1² =2.And the third one will give you the same exact 2 as before. I just do this for you.

And we now work it out. What do you think is the answer?

And yes the answer is 1. 2*2 is 4. The square root of this is 2 again. 2/2 gives us 1.*

Let's now work with a different data set. We write 2, 5, 8 for y, which gives us the mean for y.Before doing any calculations, let's take a guess what r might be.Is r is going to be 1, r going to be 3, r going to be 2, or is r going to be 0.Check one of these boxes over here.

One is actually correct, and by virtue of what I told you before, you can figure it out that r has to be 1.The reason is we know that r is between -1 and +1 so it can't be 3 or 2.And 0 is kind of this pessimistic case where there's no relationship whatsoeverbetween the data but clearly this data fits in the line.In fact, when it fits on the line no matter what the steepness of the line is if it isn't flat,if there's a positive relationship no matter how small or how large,it's going to be 1--so this is the correct answer.

Let's see if you can find this. I filled out this table for you.I know you can use this table to calculate the numerator of the fraction over here.

And the answer is 6. -1*-3 is 3. Add to it 0. Add to it another 3. It ends up to be 6.*

What's the value over here?

Clearly, it's 2. -1Â˛ is 1, 0, another 1. Add those together and you have 2.

And how about the value over here?

It's 18. The reason being -3Â˛ is 9. Add to it 0. Add to it another 9. We get 18.

So what do we get down here?

Well, 18 2 makes 36. Square root of this is 6. 6/6 is 1.We've just proven to ourselves that 1 is the correlationcoefficient even for this data set over here.

Now, how about we switch the order of the y from 3, 5, 8 to 8, 5, 3?The mean stays the same, but several of these values over here change.And before I compute it, let me again test your intuition.Is r larger than 0? r = 0? Or r smaller than 0? Pick one.

And the answer is smaller than 0. There's a negative correlation.When x goes up, y goes down.I think if you draw out the data, you get something like this for x and y.And that data perfectly fits in line which makes me believe r = -1 is the perfect correlation down.

Let me just fill in the table over here. 3, 0, -3.In our equation, this term doesn't change because it only depends on xand I haven't changed x at all.But let's compute the numerator over here.

The answer is -6. -1*3 is -3 add to it 0. Add to it another -3. You get -6.*

What's the value over here?

And just like before, it's 18. We view all the numbers, but the sum is still 9+9.

So what do...what do we get over here?

It's -1. Again, 2*18 gives us 36. Square root is 6. -6/6 gives us -1.*This is the correct correlation.

Now, let's do something tricky. Let's use 8, 5, 8 for y, which gives us a mean of 7 for y.And the following table down here 1, -2, 1. Let me test your intuition.Is r larger than 0? r = 0? Or r smaller than 0?Which means positive correlation, no correlation, negative correlation. Pick one.

And you could do the intuition thing which is you can look at thisand arrive at the point that there's no correlation and this is correct.It's a little bit tricky, but you can see x go up from 3 to 4 and y shrinks.Then it goes from 4 to 5--the opposite happens to y, and it increases by the same amount.That means in our data set, it will look as follows--up, down and up, and then they happenswe already saw the best fit is the horizontal line, and the horizontal line just meansx and y are completely independent for this best linear fit.Knowing x does nothing about y and knowing y does nothing about x,and this kind of independence leads to a coefficient of 0.Let's check it. What's the field over here? It's 0. It's 0 because -1+ 0+1=0,That's interesting even though this field over here ends up to be 6, 1²+2²+1².Zero of anything gives us a 0 at the end. So let's look at one final case.

This is our data. Now y goes from 8 to 3 up to 7. It look something like this--8, 3 back to 7.Clearly, it doesn't look very correlated. Check your intuition. Is r large than 0, equals 0 or smaller than 0.

And the answer ends up being smaller than 0.You can see from the data points if the blue one guy over here was up here,then the line horizontal line would be the best fit, but the blue is a little bit lowerso it's slightly downward tilted line, and they end up being better like this one over here.

So let's look at this and compute the mean y value, which is 6.I fill in this table for you 2, -3, 1 which is this row over hereminus 6, and now give me the first value over here.

It's -1. -2+0+1=-1.

Give me this value over here.

Now, I think it's 14--2Â˛ is 4 plus 9 is 13 plus 1 is 14.

Let's go to the final value over here. What is it?

And now we need a calculator--you have a square root of 28 and you divide -1by that square root and gets approximately -0.189 when I work this out.Today's negative correlation but it's weak. The data really in't well-described by a linear function.If the data instead were to lie exactly on this line then r will be -1 with the negative correlation.

In summary, you really learned about correlation coefficients.It's larger than 0.If there's a positive relationship between x and y.It's smaller than 0 if the relationship is negative.It's equal to 0 if there is no relationship.The magnitude of r goes to 1 as the relationship becomes increasingly linearwithout any noise or any deviation from the line.This is a powerful measure.For any data set with multiple variables,you can now tell how much variables relate to each other.If someone, for example, shows you the salary of a person and the age of a person,you can say they're really correlated, or you could say they're not correlated at all.Whatever you say, with this--what I believe to be a very simple formula,the formula right over here--you can now compute for any data set how much x and y relate.That's called the correlation, and it's really an important lesson in statistics.I use it all the time to inspect data to make a statementhow much two variables relate to each other in a linear way. Thank you.