These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
In Unit #3, we'll talk about bar charts, which is a common statistical data visualization tool. Let's look at our housing data again. This time I'll order the houses by size increasing. Here are the associated house prices from $88,000 all the way to $98,000 for these different sizes. Just as a warm up, I'll ask you a quiz that really belongs to the last unit. Is this data linear?
Without plotting it, I suggest the answer is no. The size goes up monotonically, but the cost jumps up and down. You won't find a way to drive a linear line through the data when the size is increasing by the cost is bumping up and down. I'll leave the drawing of a scatter plot to you as an exercise, but it looks about like that.
If I now ask you how much to pay for a 2200 square feet house, and you use the interpolation method that we learned in the very first class where we looked at the two nearest data items and interpolated linearly, what would you get, and in a minute I'll ask you, do you have trust in that number?
What you get is the halfway point between 2100 and 2300. That is, 105,000.
Let me ask you, do you have trust in this $105,000 number? I really want you to say no, so please take your vote.
And, yes, I don't have trust in this. The reason is from a 2100 square foot home that's clearly small than this one the price has decreased. It'd gone down. Do we believe that in general between a 2100 square foot home and a 2300 square foot home the price should go down? Do we actually believe that in a scatter plot like this the functional relationship between size and cost goes like this? Or do we instead believe it should go like this? The deviations from that linear graph is what's called "noise". Now, noise might not be the best term, but it's the term that statisticians use. It might be that one of them has a great view. The next one has an old house. This one has coastal access, which makes it more expensive. This one really requires a new kitchen. There might be factors that really effect the house price beyond the size. But if those factors are included, to a statistician that's called "random noise". But coming back to my original question, I think we don't believe the red curve. Let's talk about bar charts as one way to alleviate the problem.
In a bar chart, we talk our raw data and pull it together. For example, we might say all the data that falls into this interval over here should be summarized by a single value. Such a value would lie halfway in between these two data points and form what's called a bar. Similarly, we might pull together this data over here into a single bar and so on. Let me ask you a question. What is the height in terms of the dollar figure for the very first bar?
The answer is 80,000. It's the halfway point between 88,000 and 72,000, which is 80,000.
Let me just redo this for the second bar. Please put your number right here.
And the answer is 90,000, which is just the halfway point between 94,000 and 86,000, and these are the two data points that fall under the second bar.
I'm sure you get it to give me the number for the third bar.
These are the two points that fall into the third bar. The mean value here is 105,000.
If you look at the bar graph, what you'll find is it's a much finer representation of the data. By pooling together multiple data points into an individual bar, you can see that there is a much better way to really understand the dependence of cost to data. While the bar doesn't give you the linear relationship--in fact, in this case, happens to be nonlinear--it really gives you a sense as you go up in house sizes the cost increases, which wasn't obvious from looking at the individual data points. What the bar chart does is it really helps you to pool together groups of data into a single bar and understand global trends. Such global trends might not be that important if you only have six data points, but imagine you have 60,000. With 60,000 data points, you're scatter plot may look like this. I can tell you my hand isn't really able to draw 60,000 points. If you go to look at this data set, the individual data tells you very little, jumping in x parameter by a tiny bit might make a jump in y from here to here down to here to here and at some point down to here. Yet a bar graph can really help you understand the data. Clearly, one of the things that a statistician does is to use cumulative tools, such as bar graphs, to gain an understanding of the underlying data. Let me ask you. Are bar charts cool? Just check one of the two answers.
Honestly, if you check no, perhaps this class isn't for you. Perhaps you don't share my excitement about looking at data and using simple tools like bar charts to really understand what's going on. But if you checked yes, you're on your way to become a statistician.
I want to talk briefly about histograms as a special case of a bar chart. The key difference is whereas the bar charts we discussed so far were defined over 2D data, the histograms look at 1D data. That is, there is only one dimension of data that is being plotted. Let me start with an example. Here is a fictitious data set about annual income. Suppose at some company I asked software engineers how much annual income they make. Again, this data set is contrived. Of the nine people I asked, here is the survey of different annual salaries. In the histogram case, I make a bar chart that consists concerns itself with only one thing, which is called "frequency", which is short for "count" that will group these salaries into three different buckets--from $120,000 on, $130,000 on, and $140,000 on. What the bar chart plots is the frequency at which people asked fall into the different categories. Specifically, I am asking you what is the count for the salaries that fall into the $120,000 to $130,000 bucket. Please answer it here.
The answer is 5, because all the salaries marked here fall between $120,000 and $130,000. So the bar in the histogram plot for this interval would be 5 high.
Give me the same number for the next interval.
Yes, the answer is 2. There are exactly 2 elements that fall between $130,000 and $140,000. Obviously the final bar is of size 2 again, because there are exactly 2 elements. Now, this histogram differs from the bar chart in that the vertical axis is just a frequency count whereas before it might have been a median home sales value. In 1-dimensional data sets that are numerical, this could be informative. You can say, for example the majority of workers in this company are in this salary bucket where as a much smaller number are in higher salary buckets.
A famous histogram can be obtained by looking at the age distribution. For the USA, the distribution looks about as follows. Again, going to the statistics is an endless number of actual ages and so on. Let's make a histogram that is somewhat simplified that only looks at people from age 0 to 40. Here is the data set from 21, 17, 9, 27, and so on. I'm asking you for all these four graphs in a histogram how high are the bars for each of those four ranges as shown over here. Again, the horizontal axis depicts age, and the vertical axis the count. Please, enter your answers here.
From 0 to 10, we have 5 individuals. From 11 to 20 it's 7. From 21 to 30, it's 1, 2, 3, 4, 5. And finally from 31 to 40, it is the remaining 5 data points. In this example, our chart would look like this. This is a histogram that depicts the count in this data set as the function of the range.
In this unit you learned about bar charts and histograms. They both use vertical bars, and they both aggregate data. The big difference was that the bar chart is defined over 2D data. The one dimension applies to the x axis and the other to the y axis whereas histograms only apply to 1D data where the y axis becomes the count of that data. In the next unit we'll encounter another plot. Without giving it away, it is some how related to this birthday pie. So stay tuned.