Problem Set 6: Regression and Correlation

These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!

Contents

01 Regression

Let's consider this set of data.We have our x's and our y's.Y = bx + a. Please tell me what b is. Enter your answer here.Before you start computing, just take a look at the data.

unnamed.jpg

02 Regression Solution

Of course, since y is always 0, b is going to be 0.

03 Influential Observation

Let's say we get 1 new data point.We have 1000 for x and 1000 for y.Now tell me what b should be with this new data.Please enter your answer here.

unnamed (1).jpg

04 Influential Observation Solution

To compute this, recall that b equals the sum of the productof the difference of the x's from the mean and the y's from the meandivided by the square difference between x and the meanSo, the mean of the x's is 104, and the mean of the y's is 100.Now let's write out deviations for each term.Now for the differences of the y's.If we add up the product of these, we get a fairly large number--896,000.The denominator is equal to 892,100.Taking the ratio of these two things gives us our answer of 1.004.This is interesting, because it suggests when we have data like thisbut add one single point all the way over here,our best fit line goes from being this to being this.Now, note this isn't quite to scale. This point would really be all the way over here.

05 Correlation And Regression

Now let's look at the relationship between correlation and regressionas we make changes to our data.Let's consider the following simple data set.We have x of 1, 2, and 3 and y of 4, 7, and 13.Consider the regression coefficient for the equation y = bx + aand the correlation coefficient.Please compute b and r and enter your answers here.

unnamed (2).jpg

06 Correlation And Regression Solution

To answer this we'll apply these two formulas.First we'll compute x-bar and y-bar.X-bar is 2. Y-bar is 8.Now we'll compute our differences-- -1, 0, 1 for the x's, -4, -1, and 5 for the y's.The numerator of each of these is 4 + 0 + 5 = 9.The denominator of b is 2, so b is equal to 4.5.Now, to sum up the square differences and the y's,we have 16 + 1 + 25 = 42.The product of these is 84, so we have 9 / √84, which 0.982.

07 Double Y

Now, I wonder what happens to b and r if we double our y's.Fill in your answer here.

unnamed (3).jpg

08 Double Y Solution

Again we'll compute our x-bars. X-bar equals 2. Y-bar is 16.Now we'll take the differences. They're the same for the x's as they were before.The numerator for each of these is denominator for b, so b is 9--interesting, exactly double what it was before. This always holds true.If we double our y's, we always double b.Now let's see what happens to the correlation coefficient.We take 2 64 + 4 + 100 = 168.The correlation coefficient is 18 / √336, which is 0.982--exactly the same as it was before.

09 Double Both

Now what happens if we double our x's as well?Please enter your answer in the boxes over here.

unnamed (4).jpg

10 Double Both Solution

I won't bore you with the arithmetic again.It's the same as it was the last two times,and we wind up with the same answers we had the first time.B is equal to 4.5, and r is equal to 0.982.What I hope you've noticed is that r doesn't change when we scale x or y.B doesn't change as long as we scale x and y by the same amount.

11 Slope To Correlation

I'd like to challenge you to find an algebraic relationship between b and r.Can you tell me what is b / r?Is it 1 / Σ(y - y-bar)²?Is it Σ(x - x-bar)² / Σ(y - y-bar)²?Is it the square root of Σ(x - x-bar)² / Σ(y - y-bar)²Or perhaps is it--choose the right answer.

unnamed (5).jpg

12 Slope To Correlation Solution

The answer is this.It is equal to the square root of the sum of the deviations in the y'sover the deviations of the x's.Note that this is exactly the same as the standard deviation of ydivided by the standard deviation of x,because the number of observations divides out.To see that this is, in fact, true, when we divide b/r we can see the numerators cancel,so we have b/r equal to the square root of the squared differences in the y'stimes the sums of the square differences in the x'sover the sums of the square differences in the x's.The key thing to recognize is we can break this into two pieces.The denominator can be written as the square of the square root.Now we can just cancel this and we have our answer.

13 Standard Score Regression

Let's now consider linear regressions of standard scores.Recall the standard score for x, zₓ is equal to xᵢ - x-bar divided by the standard deviation of x.Let's assume we standardize both x and y,so we now have a set of zₓ's computed this way and zᵧ'sfor the equation zᵧ = b zₓ + a.I'd like you to tell me about the values of a and b.Specifically, tell me what is their minimum value on any given data setand what is the maximum. Please fill in your answers here.

unnamed (6).jpg

14 Standard Score Regression Solution

For a the answer is 0 and 0.A can only ever be 0.To see this, recall that a is equal to y-bar - b x-bar,but for a standard score, zₓ always has a mean of 0 and zᵧ has a mean of 0,because we subtracted out their means.A must always be 0.B must always be between -1 and 1.To see this, recall that--as you just discovered--b is equal to r timestimes the standard deviation of y over the standard deviation of x.Since these are each 1, their ratio is 1,so b must have the same range as r from -1 to 1.


Note:
Here's a detailed solution:

Let's review the equations we have learned for linear regression first:

That's the linear regression we want to accomplish:

y=b \cdot x+a

That's how we calculate the parameters a and b from the samples of x and y:

b = \frac{\sum_{i=1}^n (x_{i}-\overline{x})\cdot(y_{i}-\overline{y})}{\sum_{i=1}^n (x_{i}-\overline{x})^2}

a = \overline{y} - b \cdot \overline{x}

Where \overline{x} and \overline{x} are respectively the mean of of x and y.

That's the correlation:

r = \frac{\sum_{i=1}^n (x_{i}-\overline{x})\cdot(y_{i}-\overline{y})}{\sqrt{\sum_{i=1}^n (x_{i}-\overline{x})^2\cdot\sum_{i=1}^n (x_{i}-\overline{x})^2}}

We will also need the relationship between b and r that we have derived on PS6-6:

b = r \cdot \frac{\sigma_{y}}{\sigma_{x}}

These are the equations for the z-scores of x and y

z_{x}=\frac{x-\overline{x}}{\sigma_{x}}

z_{y}=\frac{y-\overline{y}}{\sigma_{y}}

z_{y}=b \cdot z_{x}+a

We want to obtain the correlation between z_{x} and z_{y}, in the same fashion we did for x and y in the lectures. That's exactly the same method, but instead of doing linear regression for x and y, we are going to do this for their z-scores. The z scores can be treated as new random variables, which happen to depend on x and y.

Let's first calculate the means. The mean of the sum is equal the sum of the means, so:

z_{x}=\frac{x-\overline{x}}{\sigma_{x}}=\frac{x}{\sigma_{x}} - \frac{\overline{x}}{\sigma_{x}}

\overline{z_{x}}=\overline{\frac{x}{\sigma_{x}}} - \overline{\frac{\overline{x}}{\sigma_{x}}}

\frac{\overline{x}}{\sigma_{x}} is a constant, therefore its mean is equal to itself.

The mean of a constant times a variable, is equal to the constant times of the mean of the variable, therefore:

\overline{(\frac{x}{\sigma_{x}})}=\frac{\overline{x}}{\sigma_{x}}

Resulting in:

\overline{z_{x}}=\frac{\overline{x}}{\sigma_{x}} - \frac{\overline{x}}{\sigma_{x}} = 0

The process for \overline{z_{y}} is analogous to \overline{z_{x}} so, if you repeat it for \overline{z_{y}}, you're going to find out that:

\overline{z_{y}}=0

We know that:

\overline{z_{y}} = b \cdot \overline{z_{x}} + a

Therefore:

a = \overline{z_{y}} - b \cdot \overline{z_{x}}

But since both means are zero:

a = 0

Therefore:

a range is (0, 0)

Let's now calculate b. We know that (from PS6-6):

b = r \cdot \frac{\sigma_{z_{y}}}{\sigma_{z_{x}}}

Let's calculate the standard deviation of z_{y} and z_{x}

z_{x}=\frac{x-\overline{x}}{\sigma_{x}}

z_{x}=\frac{x}{\sigma_{x}} - \frac{\overline{x}}{\sigma_{x}}

The first term is x divided by a constant and the second term is a constant.

If you add a constant to a variable or subtract a constant from a variable, this does not change its standard deviation, therefore:

stdev(z_{x})=stdev(\frac{x}{\sigma_{x}})

The standard deviation of a constant times a variable, is the constant times the standard deviation of the variable, therefore

stdev(z_{x})=\frac{stdev(x)}{\sigma_{x}}

Resulting in:

\sigma_{z_{x}}=\frac{\sigma_{x}}{\sigma_{x}} = 1

In an analogous way for z_{y}:

\sigma_{z_{y}}=1

There's an important insight here: The mean of the z score of a random variable is always zero and the standard deviation of its z score is always one.

Since b = r \cdot \frac{\sigma_{z_{y}}}{\sigma_{y_{x}}}:

b = r

b is the same as the correlation. Since the correlation range is from (-1, 1), that's true for b as well, therefore:

b range is (-1, 1)


15 Why Regression

Now I'd like to ask you a tricky question about regression that will illustratewhy regression got it's name.Why is this thing that gets us the best fit of a line?What does that have to do with regression--going backwards?I'd like you to tell me for any two variables x and y,if they're not perfectly related--that is if the absolute value of their correlationis not 1--how extreme is our y prediction compared to x?By extreme, to be a little more precise, I mean this.We have x. We have some observation in the x's--say right here.We have a total probability that's denoted by this little shaded regionthat a given xi is either at that value above--farther away from the meanor here farther away in this direction.We can map that to a y using our regression equation,and we'll get a value in the y distribution,and it will have, again, some probability of being that value or more extreme.The smallness of that probability is what we mean by extremity.One percent chance is more extreme than a 40% chance.One in a million is more extreme than 1 in 1,000.What I'd like you to tell me is will y be more extreme?Will y be as extreme?Will y be less extreme?Or it depends on the data.Please check the right answer.

unnamed (7).jpg

16 Why Regression Solution

Y will always be less extreme.The key to understanding this simply is this equation.This is what tells us that the standard scores have a regression coefficient ≤ 1.We said it's 1, and so because it's not 1 it must be less than that.That means the standard score shrinks.If the standard score shrinks that means if it's negative it moves this way.If it's positive it moves this way.That means there is more area under here,so this has to be a little more area than this.Therefore, y will always be less extreme.This is the concept of regression.That is regression to the mean.If someone is exceptional in one thing,we will always predict that person to be a bit less exceptionalin that other thing if there is any error.


Note:
Here's a detailed solution:

Saying that the value of a given variable x is "extreme", is the same as saying that this value deviates a great deal from the average. How do we score how much the value of a variable x deviates from the average? Calculating the z-score:

z_{x} = \frac{(x_{i}-\mu_{x})}{\sigma}

z_{y} = \frac{(y_{i}-\mu_{y})}{\sigma}

So the z score is going to tell how "extreme" is the value of a variable x or y. The standard deviation says how much the data as a whole deviates from the average. If you want to tell how much a single data point deviates from the average, you need the z-score.

We want to know how the "extremeness" of a value of a variable x is related to the "extremeness" of a value of a variable y (prediction), or saying it in a different way, how their z scores are related.:

z_{y}=b \cdot z_{x}+a

We have derived the relationship between the z score of two variables x and y in PS6-7, so we know that a = 0 and b = r

So the first equation simplifies to:

z_{y}=r \cdot z_{x}

Applying the module (absolute value) to both sides of the equation:

|z_{y}|=|r| \cdot |z_{x}|

|r| = \frac{|z_{y}|}{|z_{x}|}

It's given that |r| \neq 1, which is equivalent to |r| < 1, since the range of r is (-1, 1), so:

|z_{y}| < |z_{x}|

Therefore, given x and it's prediction y, the value of y is always less extreme than the value of x.