st101 » week-5 »

27.  Hypothesis Test 2

These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.


01 Height Test Example

We now come to the hypothesis test unit number 2 and we just put together what we've learned in the previous hypothesis test unit and in the unit on confidence intervals.This ends up to be the dominant way of doing hypothesis testing.We arrive at a very simple formula that we can even program.According to Wikipedia in the 2007-2008 season,the average NBA basketball player was 6 feet and 6.98 inches tall.That is the same as 200.66 cm or for the sake of this exercise, just 200 cm.Let me question this and see if this is actually acceptable.Go to a training game for NBA players.Here they are training, and I drag all of the players in the side and I take their height.As I take those measurements, I find, here the sizes I've observed.In fact, I got everything from 199 cm to 206 in 1 cm increment.And now, I want to ask the two-sided hypothesis test question whether I shall reject the claim that the average player in the NBA is 200 cm at a 95% confidence level. The trick goes as follows.You learned about confidence intervals, which is the mean plus/minus a factor taken from the T-table times a square root of the empirical variance over N.That formula you've not seen many times in practice.When I plug this in here with the appropriate (a), I get for the mean 202.5 ± 1.916.If you want to check this, I'm using 2.365 for (a).Graphically, what this means is if here is the 200 and here are my data points,the mean measurement of the sample sits over here and the 95% confidence level stops at 200.58,just short of 200, and that means we reject the hypothesis of 200 cm based on our sample of 8 people, because the confidence interval doesn't quite make it here.Now, I should argue that this isn't quite correct in practice depending on what town I go to,I might find a different sample and a different town.We say we didn't pick these people completely independently and it isn't accounted here.That's just a word of caution, but I think you get the principle. It's really simple.Use your confidence interval and check whether the outcome lies within or outside the confidence interval.


02 Club Age 1

Let me illustrate this again. For any sample, you know how to compute a confidence interval. If the null hypothesis is inside the confidence interval of the observed sample,everything is okay and you believe null hypothesis.Conversely, if the null hypothesis falls outside, you don't believe null hypothesis and you accept the alternate hypothesis.So simple is the hypothesis testing and you use confidence intervals instead of,for example, the binomial distribution and that's really the common case.Let me just briefly summarize what we're doing.Given the sample of data, you compute the mean, you compute the variance,you get the t-value at some desired error probability p and you have to make sure you pick the correct one if it's one-sided,and pick a different one for the two-sided.Then the plus/minus term in the confidence interval is simply this number over here times the square root of the empirical variance for N.Then I would have to be really cryptic but I've just given it to you,but by now, it should make a lot of sense and because we practiced every element of it.The mean, the empirical variance, getting a number from a table,and then compute the size of the confidence interval which we then put into the plus/minus terms.So µ minus this is the lower bound µ plus is the upper bound of the confidence interval according to the confidence level specified for the error probability p.Let's put this into action--a dance club operator advertised the fact that the average age of its client is 26.You walk in and encounter the following people--four people are 21,six people are 24, seven people are 26, 11 people are 29,and two people are actually 40 years old for a total of N=30.Now, compute for mean whether you know to trust the statement that the average was 26.And let's do so. Obviously, the fact that there is two 40-year-olds is a little bit disturbing.Let's do so in stages and I want to guide you through this. First, let's calculate the mean.


The instructor made a mistake here:
unnamed (5 a).jpeg

A square is missing (fix in red on the original writing). The correct variance formula is: \sigma^2 = \frac{1}{N} \cdot \sum_{i=1}^{N} (X_i-\mu)^2

03 Club Age 1 Solution

And the answer is 26.97. When you work it all out--that shows the answer.

04 Club Age 2

The variance--you've done this before.

unnamed (1).jpg

05 Club Age 2 Solution

19.57--that's, of course, truncated.

06 Club Age 3

What value we will be using for a two-sided 95% confidence interval. This is from the counter table with very large interval here,so we can use a Gaussian approximation.

unnamed (2).jpg

07 Club Age 3 Solution

Well, for 95%, we know the number--it's 1.96.  We've seen it many times.

08 Club Age 4

Now finally, what is the plus/minus term in the confidence interval that results.

unnamed (3).jpg

09 Club Age 4 Solution

I get the 1.91, so if we take 26.97 ± 1.91.We find that this claim over here is well within this confidence level.There is no reason to doubt that this would be a wrong claim based on the sample that we've drawn.Now again, just to pay attention here.In any given night there might be a reason that particularly young people come or particularly old people to come--maybe on the style of music played.Just taking a sample on one night is not fair as a statistician.It should be really independent, so you should go and in on random nights and take a random person each night.But leaving this aside, I hope you were able to follow this.If you are, you're now able to test based on data hypothesis and accept or rejection.This is a very powerful technique. In the next final exercise, you get to program all this.