These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
In Unit #3, we'll talk about bar charts, which is a common statistical data visualization tool.Let's look at our housing data again.This time I'll order the houses by size increasing.Here are the associated house prices from $88,000 all the say to $98,000for these different sizes.Just as a warmup, I'll ask you a quiz that really belongs to the last unit.Is this data linear?
Without plotting it, I suggest the answer is no.The size goes up monotonically, but the cost jumps up and down.You won't find a way to drive a linear line through the datawhen the size is increasing by the cost is bumping up and down.I'll leave the drawing of a scatter plot to you as an exercise,but it looks about like that.
If I now ask you how much to pay for a 2200 square feet house,and you use the interpolation method that we learned in the very first classwhere we looked at the two nearest data items and interpolated linearly,what would you get, and in a minute I'll ask you, do you have trust in that number?
What you get is the halfway point between 2100 and 2300.That is, 105,000.
Let me ask you, do you have trust in this $105,000 number?I really want you to say, no, so please take your vote.
And, yes, I don't have trust in this.The reason is from a 2100 square foot home that's clearly small than this onethe price has decreased. It'd gone down.Do we believe that in general between a 2100 square foot home and a 2300 square foot homethe price should go down?Do we actually believe that in a scatter plot like this the functional relationship betweensize and cost goes like this?Or do we instead believe it should go like this?The deviations from that linear graph is what's called "noise."Now, noise might not be the best term, but it's the term that statisticians use.It might be that one of them has a great view.The next one has an old house.This one has coastal access, which makes it more expensive.This one really requires a new kitchen.There might be factors that really effect the house price beyond the size.But if those factors are unincluded, to a statistician that's called "random noise."But coming back to my original question, I think we don't believe the red curve.Let's talk about bar charts as one way to alleviate the problem.
In a bar chart, we talk our raw data and pull it together.For example, we might say all the data that falls into this interval over hereshould be summarized by a single value.Such a value would lie halfway in between these two data pointsand form what's called a bar.Similarly, we might pull together this data over here into a single bar and so on.Let me ask you a question. What is the height in terms of the dollar figurefor the very first bar?
The answer is 80,000.It's the halfway point between 88,000 and 72,000, which is 80,000.
Let me just redo this for the second bar.Please put your number right here.
And the answer is 90,000, which is just the halfway point between 94,000 and 86,000,and these are the two data points that fall under the second bar.
I'm sure you get it to give me the number for the third bar.
These are the two points that fall into the third bar.The mean value here is 105,000.
If you look at the bar graph, what you'll find is it's a much finer representation of the data.By pooling together multiple data points into an individual bar,you can see that there is a much better way to really understandthe dependence of cost to data.While the bar doesn't give you the linear relationship--in fact, in this case, happens to be nonlinear--it really gives you a sense as you go up in house sizesthe cost increases, which wasn't obvious from looking at the individual data points.What the bar chart does is it really helps you to pool together groups of datainto a single bar and understand global trends.Such global trends might not be that important if you only have six data points,but imagine you have 60,000.With 60,000 data points, you're scatter plot may look like this.I can tell you my hand isn't really able to draw 60,000 points.If you go to look at this data set,the individual data tells you very little, jumping in x parameter by a tiny bitmight make a jump in y from here to here down to here to here and at some point down to here.Yet a bar graph can really help you under stand the data.Clearly, one of the things that a statistician does is to use cumulative tools,such as bar graphs, to gain an understanding of the underlying data.Let me ask you. Are bar charts cool?Just check one of the two answers.
Honestly, if you check no, perhaps this class isn't for you.Perhaps you don't share my excitement about looking at data and using simple toolslike bar charts to really understand what's going on.But if you checked yes, you're on your way to become a statistician.
I want to talk briefly about histograms as a special case of a bar chart.The key difference is whereas the bar charts we discussed so farwere defined over 2D data, the histograms look at 1D data.That is, there is only one dimension of data that is being plotted.Let me start with an example.Here is a fictitious data set about annual income.Suppose at some company I asked software engineers how much annual income they make.Again, this data set is contrived.Of the nine people I asked, here is the survey of different annual salaries.In the histogram case, I make a bar chart that consists concerns itself with only one thing,which is called "frequency," which is short for "count,"that will group these salaries into three different buckets--from $120,000 on, $130,000 on, and $140,000 on.What the bar chart plots is the frequency at which people asked fall into the different categories.Specifically, I am asking you what is the count for the salaries that fall intothe $120,000 to $130,000 bucket. Please answer it here.
The answer is 5, because all the salaries marked here fall between $120,000 and $130,000.So the bar in the histogram plot for this interval would be 5 high.
Give me the same number for the next interval.
Yes, the answer is 2.There are exactly 2 elements that fall between $130,000 and $140,000.Obviously the final bar is of size 2 again,because there are exactly 2 elements.Now, this histogram differs from the bar chart in that the vertical axis is just a frequency countwhereas before it might have been a median home sales value.In 1-dimensional data sets that are numerical, this could be informative.You can say, for example the majority of workers in this companyare in this salary bracket where as a much smaller number are in higher salary brackets.
A famous histogram can be obtained by looking at the age distribution.For the USA, the distribution looks about as follows.Again, going to the statistics is an endless number of actual ages and so on.Let's make a histogram that is somewhat simplifiedthat only looks at people from age 0 to 40.Here is the data set from 21, 17, 9, 27, and so on.I'm asking you for all these four graphs in a histogram how high are the barsfor each of those four ranges as shown over here.Again, the horizontal axis depicts age, and the vertical axis the count.Please, enter your answers here.
From 0 to 10, we have 5 individuals.From 11 to 20 it's 7.From 21 to 30, it's 1, 2, 3, 4, 5.And finally from 31 to 40, it is the remaining 5 data points.In this example, our chart would look like this.This is a histogram that depicts the count in this data set as the function of the range.
In this unit you learned about bar charts and histograms.They both use vertical bars, and they both aggregate data.The big difference was that the bar chart is defined over 2D data.The one dimension applies to the x axis and the other to the y axiswhereas histograms only apply to 1D datawhere the y axis becomes the count of that data.In the next unit we'll encounter another plot.Without giving it away, it is some how related to this birthday pi.So stay tuned.