These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
Welcome to Unit 2.Today we talk about the most important aspect of any statistics person.I want to quiz you what it is, knowing that you can't know it.But I want you to ask you first before I tell you.What is the most important thing a statistics person does?Look at data? Program computers?Run statistics of the type we'll discuss in the future obviously? Or eat tons of pizza?Just check one of those boxes.
Now I can tell you I'm tempted to check mark "eat pizza,"but that isn't the most important thing a statistician does.The most important thing is to look at the data.I can't over emphasize how important this is.A great statistician differs from an okay statistician by the ability to spend a lot of timejust looking at data.
Here is another data set.We have the sizes and square feet listed on the left,and of course prices and dollars listed on the right.I want you to look at the data.Particularly I want you to tell me is there a fixed dollar amount per square foot?Yes or no?
This data set makes it easy. The answer is no.There are two houses of the same size that sold at different prices,so it cannot be that there is a fixed dollar amount per square foot.
How about I change the house over here from 1400 square feetto 1300 square feet. What is the answer now?
It turns out now the answer is yes, and as I'm sure you verified,the cost per square feet is 70 dollars.
Obviously, the data carries a lot of information.I'll now teach you how to visualize the data,using a simple trick called a scatter plot.For that, I want you to take a piece of paper and a pen or a penciland arrange the data in a graph where the x axis is the size, and the y axis is the price.In a scatter plot, each data item becomes a dot.If we graph different sizes horizontally and prices vertically,the very first data item can be graphed as follows.You draw a line that's vertical at size 1400.Every house on this dotted like is of the same size--1400.You draw a horizontal line at the price of $98,000, just below $100,000.Where these two lines meet you have your very first data point.This dot corresponds to the very first house in your list of data.Let's do the same for the second house.
And again, the size is 2400. The price is $168,000. This is the right answer.
Let's do this for a third time for the third house on the list.Which of the boxes represent best the third house?
For a house of size 1800 square feet we pay about $126,000.This is this the dot over here.I think you get the story.In a scatter plot, each data item becomes a point.We conveniently chose a 2-dimensional list to make for a 2-dimensional scatter plot.These are the most popular scatter plots because surfaces like paper are 2-dimensional.It's really hard to do it in like 125 dimensions.Being as it is, when I draw in all the six data points, I get a scatter plot like this.That's a nice scatter plot, because I can draw a line straight through all of the data points.Now when this happens and there's a relationship that's governed by a straight line,we call the data linear.Now, linearity is a rare concept in statistics.Very often you'll find deviations, and that's because the size of the house is notthe only determinant in the cost or perhaps because most of us are bad negotiators.But when a data set is linear, it's really easy to predict prices of houses in between.For example, a house of this size ought to cost this much.You can read out of a scatter plot the dependence of the size of a house and its price.Here we are--we're actually looking at the data.We're doing what a statistician ought to do.
I'll now ask you to make your own scatter plot.Please graph this new data and tell me based on the graph--please really draw it on a piece of paper--whether the function between price and size is linear--that is, can you draw a line through all the scatter plot data on your diagram?Is this linear? Yes or no?
To answer this, let me make the scatter plot.Here is my x-axis, which measures size.Here is my y-axis, that depicts price.If I graph the different sizes, I get those 6 data points over here.If I graph the different prices, I get the lines over here.Sure enough, connecting those things get's me a data set that happens to be exactly linear.
So, just checking do you believe there’s afixed price per square foot.Answer, yes or no?
And yes, it’s US$30 per square foot.
Let's do this again for different prices.I just changed all the prices of those homes.Let me ask you the same question.Do we believe there is a fixed price per square foot?Check yes or no.
The answer is no.If you look at the first data point and divide the price by size,you get approximately 31.1765.Compare this to the next data point, which gets you 30.9524.They're about the same but not quite.In fact, they're all different for the different data points, as you can see over here.
Let's make a scatter plot.Please take a piece of paper and graph the size relative to the priceas you are now used to in scatter plots.My question for you is when you graph this is the data linear.That is, can you fit a line through the scatter plot data?Yes or no?Please graph it very carefully and then make your judgment.It's a non-trivial question.
Amazingly the answer is positive.Let me graph the data--first data point, second, third, fourth, fifth, and sixth.While my hand drawing isn't perfect, if you're careful you'll find out that they're just linear.
That's amazing--even though there is not a fixed dollar price per square foot,the relationship is linear.Here is a a really challenging question.In this data, the prize is linear in the size plus or minus a constant dollar amount.Can you fill in those values?I should say this is a nontrivial question,and I don't expect you to get this right, but it's a wonderful, wonderful previewof what I'll teach you to do in this class very automatically using really amazingpieces of statistics.
The answer happens to be the square foot costs $30,but there is a constant cost added of $2000.If you plug this in, you'll find, for example, that 1700 times 30 gives $51,000,but if you add the $2000, you'll correctly get $53,000.Do it again with 2100 square feet over here multiplied by $30, add your $2000.You get $65,000 and so on.Now, again, arriving at those numbers is nontrivial.I wouldn't be surprised if you weren't able to complete this exercise.But if you did, you're quite amazing, because that is the relationship between the priceand the size in my example.
In a final exercise,I'd like you to scatter plotthe data that is modified,by modifying two of those prices,and please again, plot the data,and answer for me the question,whether you can fit a line, or for differently,"is this linear?"I really want you to go with a piece of paper to draw those points.
The answer happens to be no.If I draw the original blue data points, the data is indeed linear,but this data point over here of a house of size 2100 square feetsold remarkably cheap, so we get the point right over here.Similarly, the house of size 1300 square feet sold for a lot.Both of these data points are outliers.We'll talk about outliers later.There is clearly no way to fit a linear function through all those data points.I would even question that this wicked kind of function that I drew is a goodaccurate representation of the relationship between size and price.You would not expect that as you go south of 1500 square feetthat the price would rapidly increase and then decrease againbecause there is likely nothing special about small houses than justifies the price over here.
You finished Unit #2, and you now know a lot about scatter plots.They tend to be 2-dimensional, and real-world scatter plots look often just like thiswhere a simple eyeballing tells us something about the relationshipof one variable to another.Now, scatter plots aren't great when there is what is called "noise."That is, the data deviates from the exportation in some random, noisy way.In the next unit, we talk about a simple plot called "bar chart"that addresses the issue of noise in data by pooling data points into a single cumulative bar.So stay tuned.