st101 » week-5 »

Problem Set 5: Inference


These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.

Contents

01 Poll Error

In opinion polls, we often report a margin of error.Something like this margin of error is the same as the confidence interval width.Usually, the margin of error in polls is stated at the 95% level.Consider two candidates. In our poll, candidate A has 55 supporters.Candidate B has only 45 supporters.Please tell me what we would report as the results of the poll.Percentage of support for candidate A plus or minus the margin of error at 95% confidence.

unnamed.jpg

02 Poll Error Solution

Of course the percentage support for candidate A is 55%.To compute the margin of error, we will use the following formula.1.96 the quartile from the normal at 97.5% times the square root ofP times 1 minus P equals n and substituting n and evaluating this expression the square root of 0.550.45/1001.96 is equal to 9.75%.

03 Sample Size

Let' assume we want a lower margin of error to say 5%. How many people should we ask?

unnamed (1).jpg

04 Sample Size Solution

Now, you might say that the answer will depend in part on what we see the response is being but to simplify this let's assume that 55% still will support candidate A.Well that won't quite be right. So to answer this question, we'll need to solve this formula for n.Let's call the margin of error e.  We can actually simplify this.If we just square and divide, we get e²/1.96² is equal to the old variance 0.550.45/n.We actually know e. We specify that is going to be 0.05.That gives us, if we multiply through by n, so if we calculate this out we get 380.3.Now since it's pretty hard to ask a fraction of a person, we actually have to ask a bit more.We have to ask 381 people.

05 Sensitivity

In answering how many people we actually need to poll, we made an assumption. And the assumption more or less is that small changes in p don't matter.So we don't have to worry about getting it exactly right if we're trying to estimate a sample size.To prove this to ourselves, let's redo this calculation with p = 0.5 and p = 0.6.Please enter your values for n here.

unnamed (2).jpg

06 Sensitivity Solution

Looking at our solution over here, we can see that the only place this matters is p(1-p) flows right through to down here.So we need to change these to numbers.So this is going to be 0.50.51.96²/0.05²and for this it will simply be 0.6 times 0.4 times the same thing.For p = 0.5, this evaluates to 384.16 and for 0.6 it evaluates to 369.79.Since again we can't have fractional people, the answer is 385 and the answer for 0.6 is 369.A quick way to see why this changed so little is simply that this is the same.N 0.50.5 is 0.25. N 0.60.4 is 0.24.

07 More Error

Often we find in reality that the margin of error is wrong.Which of these reasons could make us understate the true margin of error?If the poll is taken before the election at a different time, does that change the margin of error?If some voters are under-represented in our sample, could that increase error?Similarly, if some in our sample are less likely to vote than others, does that affect our error?Now, assume that we haven't somehow perfectly matched up the frequency of voting and their relative representation in sample.And if the responses differ from what someone intends to actually do, could that cause more error?

unnamed (3).jpg

08 More Error Solution

Each of these things can cause more errors. The answer is all of them marked.The poll being taken before the election can certainly affect its accuracy because sentiments shift over time.Some being under-represented in the samples.Certainly, if voters who favor one candidate versus another are under or over-represented that will change the results.Similarly if we tend to have people who are more likely to vote represented the same as those who are less likely to vote,they're not being over-represented enough.So these two things ideally should be linked.And in a well-designed poll, they try to do this although it's hard to do this perfectly.And responses that differ of course will cause additional error.

09 Weight Loss

Let's assume we have two different ways to lose weight,and we want to figure out which one is the most effective.We have 10,000 people who received Treatment A. Their average loss is 10 pounds.The standard deviation of their loss is also 10 pounds.Let's consider a second treatment, Treatment B.We also applied it to 10,000 people. Our average loss in this case was 20 pounds.And we also have a standard deviation of 20 pounds.Our null hypothesis of course is that weight loss from Treatment A equals the average weight loss from Treatment B.Our alternative hypothesis is if we are perhaps the providers of Treatment Bis that WB is greater than WA or alternatively that the difference is positive.With a 5% allowable false positive rate, which hypothesis do we choose to accept?

unnamed (4).jpg


Note:
What "false positive rate" means? If you choose a 95% confidence interval and do repeated experiments (sample from a population and test your hypothesis using the method), 95% of your experiments should confirm the null hypothesis and 5% should reject the null hypothesis (the alternate hypothesis is correct instead).

When he says "5% false positive rate", he makes this statement in regard to the alternate hypothesis (w_{B} > w_{A}) assuming that the the null hypothesis (w_{B} = w_{A}) is true. If you assume that the null hypothesis is true, in regard to the alternate hypothesis, any experiment that confirms such hypothesis is a false positive (experiment confirms the alternate hypothesis, therefore positive, but you know that isn't true, therefore false). In short, saying that you want a "5% false positive rate" is the same as saying that you want a "95% confidence interval".


10 Weight Loss Solution

To answer this, our variable of interest is the difference of these two things that has a mean of 10 pounds and its standard deviation is going to be the sumof this standard deviation and this standard deviation.Now we need to find out the standard deviation of this. Since variances add, let's work with those.So the variance of Treatment B is 400/10,000 and the variance of Treatment A is 100.So if we evaluate this, this will give us the standard deviation of this number.And it's 0.2236. So we can see the ratio of this to this.This is much, much greater than the critical value of 1.645.And so in this case we reject the null hypothesis and therefore accept the alternative.And there's one particularly interesting feature of these two treatments as they relate to each other.Just consider how many people lost weight in each case, and post your thoughts to the forums.


Note:
From the lecture "Manipulating Normals", you know that you can combine the random variables w_B and w_A and create a new variable (w_B-w_A). More specifically, from the lecture "Subtracting Normals", you know that you need to subtract the averages and add up the variances:

\sigma_{B-A}^2 = \sigma_{B}^2 + \sigma_{A}^2

\mu_{B-A} = \mu_{B}-\mu_{A}

Since we are working with the mean over the sample:

\sigma_{B}^2 = \frac{20^2}{10000}

\sigma_{A}^2 = \frac{10^2}{10000}

The variance of the mean over the sample is equal to \frac{\sigma^2}{N} (or the standard deviation is equal to \frac{\sigma}{\sqrt{N}}), as we have learned in the lectures from "Confidence Interval". Therefore:

\sigma_{B-A}^2 = \frac{10^2}{10000} + \frac{20^2}{10000} = \frac{400}{10000}+\frac{100}{10000}

\mu_{B-A} = 20 - 10 = 10

You also know that the confidence interval is given by the following formula:

CI = a \cdot \sqrt{\frac{\sigma^2}{N}}

Where \sqrt{\frac{\sigma^2}{N}} is the standard deviation of the mean over the sample (or \frac{\sigma^2}{N} is the variance of the mean over the sample) and "a" is the "magic number".

The magic number marks the 95% confidence interval. If "a" is greater than 1.645 (1.96 is the approximation for N \geq 30, the TA was a little more precise and used a T-Table to get "a" for 95% one-tail and large, which is 1.645), you know that you're outside the confidence interval.

You have a \mu_{B-A} = 10 and you would like to know if the mean could be equal to zero (wB = wA). To include zero inside your confidence interval, you would need a confidence interval of 10 Lbs, therefore:

CI = 10

Manipulating the confidence interval formula:

a = \frac{CI}{\sqrt{\frac{\sigma^2}{N}}}

a = \frac{CI}{\sqrt{\sigma_{B-A}^2}}

Substituting the values:

a = \frac{10}{\sqrt{\frac{400}{10000} + \frac{100}{10000}}} \gt 1.645

Therefore H_0 (w_B = w_A) does not hold.

A different way to solve the problem:

If you go to the T-table first, and get a = 1.645, you can calculate the CI, since you have the mean and variance for the new combined distribution:

CI = 1.645 \cdot \sqrt{\frac{400}{10000} + \frac{100}{10000}} = 0.3678331822987154

Which results in the following interval: 10 \pm 0.3678331822987154

So the confidence interval goes from 9.632166817701284 to 10.3678331822987154, for which zero is clearly out. Therefore, H_0 (w_B = w_A) does not hold. I prefer this way, since it is closer to the instructor's explanation and easier to understand.


11 Hypothesis Tests

Now, I'd like to give you a number of situations where you might be able to use hypothesis tests.In which of these situations would it be helpful to use a hypothesis test?Comparing crash rates of two different airlines with 100 flights each.In case you don't know but I assume most of you do, planes don't crash very often.Comparing five-year auto repair rates on 100 vehicle samples.Comparing the heights of the tallest buildings in two different cities.And determining if a course improved test scores in some sample of students.

unnamed (5).jpg

12 Hypothesis Tests Solution

And I would say the answers are comparing auto repair rates.Those are simply proportions that can be compared to each other as we've done for many things.And determining if a course improve test scores.We would argue that we can't compare crash rates of two airlines with 100 flights because in general there aren't crashes.Now in theory, you could say you can use a hypothesis test if you recognize that it's not normally distributed and do the appropriate binomial test and figure out that you can't determine anything but it's generally not adding anything over intuition.And comparing heights of the tallest building in two cities, that's a deterministic number.It's either their taller or it's not. There are no random errors. There's no need for a hypothesis test.That again a more creative application could be that you'd want to use it to the measurements of the buildings in each of the two cities to determine whether one was taller but that's only going to be useful if the measurements have the error on the same order of the difference in buildings which is unusual.So in practice you would not use a hypothesis test there.

13 Large Sample Limit

As the sample size approaches infinity, what happens to the confidence interval width? As the sample size approaches infinity, the confidence interval width approaches--0. Some constant that varies based on the problem.

unnamed (6).jpg

14 Large Sample Limit Solution

The answer is 0. See this.Remember that the width of the confidence interval decreases with the √n.But as n goes to infinity, the √n goes to infinity and any number divided by infinity is 0.

15 Large Sample Test

Consider two marketing campaigns. Campaign A has a response rate of 1%. And we have a second marketing campaign, Campaign B, that has a response rate of 1.01%.For each of these, we have 10 million data points. At 95% confidence, is this significant?

unnamed (7).jpg

16 Large Sample Test Solution

And the answer is yes.You might guess this just from the number of observations, but let's go through the math.So we have the difference of 0.01% or 0.0001, divide that by the total variance which is--And this is equal to 2.24. This is significant.


Note:
The question is if the difference of 0.01% between campaign A and B is significant, i.e., if their \mu's (1% for campaign A and 1.01% for campaign B) could be the same. This statement is equivalent to the following:

H_0: \quad \mu_{A} = \mu_{B}

H_1: \quad \mu_{A} \neq \mu_{B}

Which is the same as:

H_0: \quad \mu_{B} - \mu_{A} = 0

H_1: \quad \mu_{B} - \mu_{A} \neq 0

That defines your hypothesis test.

Remember that "a" ("the magic number") marks the 95% confidence interval. If "a" is greater than 1.96, you know that you're outside the confidence interval (i.e., in the blue areas), and therefore your null hypothesis does not hold.

You have a \mu_{B-A} = 0.01 \% and you would like to know if it could be equal to zero ( \mu_{A} = \mu_{B}). To include zero inside your confidence interval, you would need a confidence interval of 0.01%, therefore:

CI = 0.0001 (That's the same as 0.01 %)

The confidence interval formula is:

CI = a \cdot \sqrt{\frac{\sigma^2}{N}}

Manipulating the formula:

a = \frac{CI}{\sqrt{\frac{\sigma^2}{N}}}

We are combining two distributions A and B (their difference), therefore the variance of the combined distribution is:

\frac{\sigma_{B-A}^2}{N_{B-A}} = \frac{\sigma_{B}^2}{N_{B}}+\frac{\sigma_{A}^2}{N_{A}}

\sigma's are not given for the distributions, but you know that the expected variance can be calculated as follows:

\sigma^2 = p \cdot (1-p)

Substituting:

a = \frac{CI}{\sqrt{\frac{\sigma_{B}^2}{N_{B}}+\frac{\sigma_{A}^2}{N_{A}}}}

a = \frac{CI}{\cdot \sqrt{\frac{p_{B}\cdot(1-p_{B})}{N_{B}}+\frac{p_{A}\cdot(1-p_{A})}{N_{A}}}}

Substituting the values:

a = \frac{0.0001}{\cdot \sqrt{\frac{0.01\cdot(1-0.01)}{10000000}+\frac{0.0101\cdot(1-0.0101)}{10000000}}} = 2.241792417319815 > 1.96

Since zero is outside our confidence interval of 95%, the null hypothesis is rejected (H0: \mu_{A} = \mu_{B}) and therefore the difference is significant.


T-Distribution Table
t_dist.gif

Source: http://mips.stanford.edu/courses/stats_data_analsys/principles/t_table.html