**These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!**

Contents

- 1 Problem Set 6: Regression and Correlation
- 1.1 01 Regression
- 1.2 02 Regression Solution
- 1.3 03 Influential Observation
- 1.4 04 Influential Observation Solution
- 1.5 05 Correlation And Regression
- 1.6 06 Correlation And Regression Solution
- 1.7 07 Double Y
- 1.8 08 Double Y Solution
- 1.9 09 Double Both
- 1.10 10 Double Both Solution
- 1.11 11 Slope To Correlation
- 1.12 12 Slope To Correlation Solution
- 1.13 13 Standard Score Regression
- 1.14 14 Standard Score Regression Solution
- 1.15 15 Why Regression
- 1.16 16 Why Regression Solution

Let's consider this set of data.We have our x's and our y's.Y = bx + a. Please tell me what b is. Enter your answer here.Before you start computing, just take a look at the data.

Of course, since y is always 0, b is going to be 0.

Let's say we get 1 new data point.We have 1000 for x and 1000 for y.Now tell me what b should be with this new data.Please enter your answer here.

To compute this, recall that b equals the sum of the productof the difference of the x's from the mean and the y's from the meandivided by the square difference between x and the meanSo, the mean of the x's is 104, and the mean of the y's is 100.Now let's write out deviations for each term.Now for the differences of the y's.If we add up the product of these, we get a fairly large number--896,000.The denominator is equal to 892,100.Taking the ratio of these two things gives us our answer of 1.004.This is interesting, because it suggests when we have data like thisbut add one single point all the way over here,our best fit line goes from being this to being this.Now, note this isn't quite to scale. This point would really be all the way over here.

Now let's look at the relationship between correlation and regressionas we make changes to our data.Let's consider the following simple data set.We have x of 1, 2, and 3 and y of 4, 7, and 13.Consider the regression coefficient for the equation y = bx + aand the correlation coefficient.Please compute b and r and enter your answers here.

To answer this we'll apply these two formulas.First we'll compute x-bar and y-bar.X-bar is 2. Y-bar is 8.Now we'll compute our differences-- -1, 0, 1 for the x's, -4, -1, and 5 for the y's.The numerator of each of these is 4 + 0 + 5 = 9.The denominator of b is 2, so b is equal to 4.5.Now, to sum up the square differences and the y's,we have 16 + 1 + 25 = 42.The product of these is 84, so we have 9 / √84, which 0.982.

Now, I wonder what happens to b and r if we double our y's.Fill in your answer here.

Again we'll compute our x-bars. X-bar equals 2. Y-bar is 16.Now we'll take the differences. They're the same for the x's as they were before.The numerator for each of these is denominator for b, so b is 9--interesting, exactly double what it was before. This always holds true.If we double our y's, we always double b.Now let's see what happens to the correlation coefficient.We take 2 64 + 4 + 100 = 168.The correlation coefficient is 18 / √336, which is 0.982--exactly the same as it was before.

Now what happens if we double our x's as well?Please enter your answer in the boxes over here.

I won't bore you with the arithmetic again.It's the same as it was the last two times,and we wind up with the same answers we had the first time.B is equal to 4.5, and r is equal to 0.982.What I hope you've noticed is that r doesn't change when we scale x or y.B doesn't change as long as we scale x and y by the same amount.

I'd like to challenge you to find an algebraic relationship between b and r.Can you tell me what is b / r?Is it 1 / Σ(y - y-bar)²?Is it Σ(x - x-bar)² / Σ(y - y-bar)²?Is it the square root of Σ(x - x-bar)² / Σ(y - y-bar)²Or perhaps is it--choose the right answer.

The answer is this.It is equal to the square root of the sum of the deviations in the y'sover the deviations of the x's.Note that this is exactly the same as the standard deviation of ydivided by the standard deviation of x,because the number of observations divides out.To see that this is, in fact, true, when we divide b/r we can see the numerators cancel,so we have b/r equal to the square root of the squared differences in the y'stimes the sums of the square differences in the x'sover the sums of the square differences in the x's.The key thing to recognize is we can break this into two pieces.The denominator can be written as the square of the square root.Now we can just cancel this and we have our answer.

Let's now consider linear regressions of standard scores.Recall the standard score for x, zₓ is equal to xᵢ - x-bar divided by the standard deviation of x.Let's assume we standardize both x and y,so we now have a set of zₓ's computed this way and zᵧ'sfor the equation zᵧ = b zₓ + a.I'd like you to tell me about the values of a and b.Specifically, tell me what is their minimum value on any given data setand what is the maximum. Please fill in your answers here.

For a the answer is 0 and 0.A can only ever be 0.To see this, recall that a is equal to y-bar - b x-bar,but for a standard score, zₓ always has a mean of 0 and zᵧ has a mean of 0,because we subtracted out their means.A must always be 0.B must always be between -1 and 1.To see this, recall that--as you just discovered--b is equal to r timestimes the standard deviation of y over the standard deviation of x.Since these are each 1, their ratio is 1,so b must have the same range as r from -1 to 1.

*Note:*

Here's a detailed solution:

Let's review the equations we have learned for linear regression first:

That's the linear regression we want to accomplish:

That's how we calculate the parameters

and from the samples of and :

Where

and are respectively the mean of of and .That's the correlation:

We will also need the relationship between

and that we have derived on PS6-6:

These are the equations for the z-scores of

and

We want to obtain the correlation between

and , in the same fashion we did for and in the lectures. That's exactly the same method, but instead of doing linear regression for and , we are going to do this for their z-scores. The z scores can be treated as new random variables, which happen to depend on and .Let's first calculate the means. The mean of the sum is equal the sum of the means, so:

is a constant, therefore its mean is equal to itself.

The mean of a constant times a variable, is equal to the constant times of the mean of the variable, therefore:

Resulting in:

The process for

is analogous to so, if you repeat it for , you're going to find out that:

We know that:

Therefore:

But since both means are zero:

Therefore:

range is

Let's now calculate

. We know that (from PS6-6):

Let's calculate the standard deviation of

and

The first term is

divided by a constant and the second term is a constant.If you add a constant to a variable or subtract a constant from a variable, this does not change its standard deviation, therefore:

The standard deviation of a constant times a variable, is the constant times the standard deviation of the variable, therefore

Resulting in:

In an analogous way for

:

**There's an important insight here: The mean of the z score of a random variable is always zero and the standard deviation of its z score is always one.**

Since

:

is the same as the correlation. Since the correlation range is from (-1, 1), that's true for b as well, therefore:

range is

Now I'd like to ask you a tricky question about regression that will illustratewhy regression got it's name.Why is this thing that gets us the best fit of a line?What does that have to do with regression--going backwards?I'd like you to tell me for any two variables x and y,if they're not perfectly related--that is if the absolute value of their correlationis not 1--how extreme is our y prediction compared to x?By extreme, to be a little more precise, I mean this.We have x. We have some observation in the x's--say right here.We have a total probability that's denoted by this little shaded regionthat a given xi is either at that value above--farther away from the meanor here farther away in this direction.We can map that to a y using our regression equation,and we'll get a value in the y distribution,and it will have, again, some probability of being that value or more extreme.The smallness of that probability is what we mean by extremity.One percent chance is more extreme than a 40% chance.One in a million is more extreme than 1 in 1,000.What I'd like you to tell me is will y be more extreme?Will y be as extreme?Will y be less extreme?Or it depends on the data.Please check the right answer.

Y will always be less extreme.The key to understanding this simply is this equation.This is what tells us that the standard scores have a regression coefficient â¤ 1.We said it's 1, and so because it's not 1 it must be less than that.That means the standard score shrinks.If the standard score shrinks that means if it's negative it moves this way.If it's positive it moves this way.That means there is more area under here,so this has to be a little more area than this.Therefore, y will always be less extreme.This is the concept of regression.That is regression to the mean.If someone is exceptional in one thing,we will always predict that person to be a bit less exceptionalin that other thing if there is any error.

*Note:*

Here's a detailed solution:

Saying that the value of a given variable

is "extreme", is the same as saying that this value deviates a great deal from the average. How do we score how much the value of a variable deviates from the average? Calculating the z-score:

So the z score is going to tell how "extreme" is the value of a variable

or . The standard deviation says how much the data as a whole deviates from the average. If you want to tell how much a single data point deviates from the average, you need the z-score.We want to know how the "extremeness" of a value of a variable

is related to the "extremeness" of a value of a variable (prediction), or saying it in a different way, how their z scores are related.:

We have derived the relationship between the z score of two variables

and in PS6-7, so we know that andSo the first equation simplifies to:

Applying the module (absolute value) to both sides of the equation:

It's given that

, which is equivalent to , since the range of is , so:

Therefore, given

and it's prediction , the value of is always less extreme than the value of .