These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
Welcome back to Statistics 101.Today, we do a case study.The case study is based on data that you, the students, submittedwhere on Facebook you all discussed my weight,and you've all see me in this classand a whopping 163 of you followed this very hidden Facebook discussionand submitted your guesses.Thanks to all of you, I now have a data set.In my first study, I will treat your guesses as my data or sampleand somewhat abusing the nature of my own weight.I treat my own weight as the hypothesis H0.Technically speaking, this is a bit of an abuse.We're not really testing whether my own weight is incorrect,but you can still use the same math and have a lot of fun.Towards the end of this exercise, you will actually find out my weight.Please don't tell anybody.Here is the actual data that was submitted by all of you.One of the really interesting findings here--some of you had some fun.In the data, you find a guess 1 10²², which is about the weight of the planet Pluto in kilograms.Then there are three counter guesses with -1,which makes me suspect that these numbers were submitted by the same person.Some of you think I weigh 0 whereas others think I weigh 1000 kg--as much as a car.Thank you so much. Very kind of you.Let's take this data, and I'll ask you the first quiz.
We have 163 samples.Remember the outlier removal method using quartiles?How many samples survive?
The answer is 83.If you remember this method, you realize there were four quartiles,and individual elements separate those.This one's called the median, the lower, and the upper quartile,and these guys are the ones being picked.Now, 163 - 3 special elements makes 160, divided by 4, makes each of these 40.We pick the two center 40s, plus these three special elements and get 83.
For these 83 elements, I'll tell you that the sum of the values is 6618.47.The sum of the square is 528,679.Now, you can work this out yourself, but then you have to go through a awful lot of numbers,so I just did this for you. Give me the mean.
The answer is 79.74.The way you get that is divide the 6618 by 83. How about the variance?The variance is about 11.1 in approximation.You might remember that we can calculate that is exportation or the mean of the x²minus the mean of the x to the square, which I just did here.Now, give me the standard deviation. That is 3.33, the square root of the variance.
Now I want the plus/minus term, the two-sided confidence intervalat a confidence level of 95%.This is the number we can add or subtract to the mean,so that at 95% confidence the true mean lies within the interval.Please enter it here.
The answer is 0.72.The way we get there is to take the variance and divide it by the square root of n--83.Then factor in the magic number 1.96,which is the two-sided 95% confidence interval number for the normal distribution.That gives us 0.72.
Here is now finally the revelation.My weight is obtained if you assume as standard score of 2.It so turns out.Can you give me a good guess what my weight will be if the standard score is 2 for me?I should add that the mean over here is the heavier of the two numbers. So, I don't quite weight as much as here.
The instructor made a mistake here. Given that "the mean over here is the heavier of the two numbers", the standard score is -2, not 2.
The answer is about 73 kg.73.086 would be the exact number, but I'm happy with 73.
Let's do something that isn't quite correct semantically but we can do anyhow.Suppose the null hypothesis is 73,and with a two-sided test, using a 95% confidence thresholdwould we accept or reject the null hypothesis? Just pick one.
The answer is reject, and you can see it easily from the data shown here.79.74 is the mean. This is the 95% confidence interval--less than a kilogram.The reason why it's so small is because we have so many data samples, really interesting.As a result, we don't even come close to 73.So, if all the guesses that you guys voiced online were actual measurementsusing the actual scale of Sebastian,and this is what I told you and the question really was H0.Is Sebastian telling the truth or is he lying?The, using statistical techniques, you should conclude he's lying.This is somewhat contrived. This is the only data point, and everything else is just guesses.But I hope you had fun using actual data that you students derived in answering this question.
Welcome back to statistics.Today we do a case study in regression.In this case study, I will use again the data that you all submittedabout my current weight called L and my previous weight from last yearas you guessed it all in kilograms called P.Here is my question for you as a well-trained statistician.What is the very first thing I'll do?Compute the mean, the variance, and similar simple statistics?Run a scatterplot?Or go straight to the end and run my regression methodsto see what's the best line that fits the data? Pick one.
The best answer is to run scatterplots,and the reason is I taught you that you should always look at the data firstbefore doing anything with the data.So, here is the scatterplot for those data points.It's really interesting. If I look at this, it looks like almost all the guesses are the same.That's the case until I realize that my vertical axis goes from 1e22all the way down to -1e22.Obviously there are outliers in that data.This plot makes it completely obvious to me that this isn't good data.Let's take this plot away, and let's look at the data.As before, we notice that there are these extreme caseswhere we have 1E+22--the same over here.I have to ask myself how to remove outliers.
I have to ask myself how to remove outliers.Let me ask you a question.Can we use the quartiles method? Yes or no?And I'm not asking whether there is in principle a variant of the method that can be used,but the way you've learned the method--can we just go through both listsand independently in both lists remove the lower and the upper quartile?Yes or no?
The answer is no. I mean, of course you can do it, but it'll give you really bad data.The reason is the data is paired.These two guesses correspond to each other.The same person guessed 78 over here and 80 over here.The same is true with the next two guesses.If we just remove the lowest quartiles in the first list and the second list,we'll remove different elements, and the surviving elements don't fit together,and we can't really run regression anymore.We have to keep that correspondence alive.I removed all the data points where either of the two componentswould guess a value under 50.If it said 0 or 1 over here, I'd remove the entire data point in both lists.The same is true for weights over 120.I figure I really don't look like a 200 kg guy or like a 1,000 kg person.I figured 120 is fine, but beyond 120 is really not a reasonable guess.The reason why I use these brackets is I really have to remove data pairs together.That was the easiest way for me to remove outliers.Let's look at the data again.
Here is that data after outlier removal.I shifted by subtracting 50 from either dimension.This goes from 50 to 120 even though it says 0 to 70.The same is true over here.Now we can see the guesses are more rational.They seem to line up a little bit in linear fashion.They are certainly increasing, which means people who guessed a high weight for me todayas in the vertical axis likely also guessed a high H for me last year.The guesses aren't entirely random.What I'm going to ask is how did the guesses this from last year in the horizontal axiscorrelate with the guesses from this year.I'm going to ask you what line best describes this data,and we're going to compute this using the formulas we learned about in class.Here is a quick quiz for you.If you look at that data between axes x and y,I will assume that you calculate the correlation r,but for now I want you to look at this and tell meyour best guess of r.I'll give you a couple of hypotheses— -1, 0.7, 0, +0.7, and +1.None of those is absolutely correct, but one of them is quite close.I think you can figure out which one it is.Please check one of these five boxes--exactly one of those.
The answer happens to be 0.7,and the reason is we can see that there is a positive correlation between x and y.It is not a 1. That would mean there would be no spread on the data.They'd all line up nicely on the line.But there is spread around this data, so it can't be 1.It's not zero, because there would be no correlation.It's not negative, because the line doesn't go down. It goes up.So, 0.7 is the best guess for the correlation.It's always good to look at the data and guess it in advance,because then you can see if your math is actually correct.Let's dive in and calculate the actual correlation.
For last year's weight, or the x axis, your mean estimate was 81.2.I'm very flattered to report for this year on average after outlier removalyou believe my weight went down.The standard deviation for x was 10.6, and the same for y is 9.34.Finally, I'll give you the mean of the following mixed product,the difference from the original axis and the x bar, the mean,times the same for y.This looks an awful lot like a variance, but it's between two different variables.We call this a covariance or "cov" between two variables x and y.In our example, the numerical value for the covariance is 75.36.Using the covariance it is amazingly easy now to calculate the correlation r.We just divide it by the two standard deviations for x and y.Please give me that number.
The answer for me is 0.76.It's between ±1.It's positive. We knew there was positive correlation.It's not 1, but it suggests a very strong correlation.
Now, b is also easily expressed using the covariance,but now we're normalizing by the variance of x.What do you think that is?
The answer is 0.67.The caveat here was that the variance is the square of the standard deviation.So, you have to divide 75.36 by the square of 10.6,which is just a little over 100 and gives us 0.67.
a, you might remember is the mean of y minus b times the mean of x.I'm sure you can compute this for me.
The answer is 25.2.The line of 25.2 + 0.67x = y.This line over here is linear regression to the data.Pictorially, this is the line just like this.It might be hard to see as my shift of units, this is really 50 and this is 120.When you work this out, this line hits the y axis at 25.2,and it's steepness when it goes one step right, it goes 0.67 steps up vertically,which is off the line.We just did the following: we inspected a data sample with two dimensions,and we saw it had massive outliers.We removed the outliers using a very context-dependent method of just thresholding.We calculated some basic statistics, such as the mean, the standard deviation, and the covariance.Then we came to the meat of the unit.We computed the correlation and the regression,and those two things were the primary result of the step,but they would not have been possible without the initial steps of cleaning up the data.I hope we got those right.If you did, you are a very capable statistician at this point.This is a very typical question that arises in statisticsand getting it correct means you have a fairly deep understanding of somethingreally nontrivial that comes up in statistics. Congratulations.