These are draft notes from subtitles. Please help to improve them. Thank you!
**Welcome to Statistics 101, taught by me, Sebastian Thrun and by Adam Sherwin, who is our assistant instructor doing a lot of work in the background. I figure I'll start with giving you a teaser, a challenging teaser. I'm going to provoke you. So, this is you. I believe you should be unhappy, not because our class is so bad, but you and I will prove in a second that you are unpopular. The reason why I show this is to show you how deep statistics is and how we can easily fool ourselves. Let's dive in.
Let's say there are two types of people for simplicity--type A and type B. Type A are the popular ones. They have 80 friends. And type B are less popular. They only have 20 friends. You might now say that I don't know which type you are. I will compute what's called the expected or average number of friends. To do so, I assume that half of the people are of type A and half of the people are of type B. Here's your very first quiz in this class. Into this box enter what you think is the expected or average number of friends if you have a 50% chance of being a type A and a 50% chance of being a type B. Type in here when you're done. There is a submit button somewhere down here. You can see if your answer is correct. Regardless of whether you finish this or not, at some point just hit the next button, and you'll see my answer.
My answer is 50 friends. The way I get there is I don't know what type you are with a 50% chance or 1/2 you're type A, in which case you have 80 friends, and a 50% chance you're type B and have 20 friends. That gives me this equation over here as the expected number of friends. Working this out means you have 40 + 10 = 50 friends. So far, so good, but why are you unpopular?
Here is your Facebook or G+ page, and of course you're smiling. On it is your list of friends, and we already know it's either 80 or 20 friends. In expectation it's 50 friends. Let's pick a random one of your friends, like this one. This person will also have a Facebook or a G+ page. Before I raise the question how many friends this person has, let's consider that this might be either a type A or a type B person. Keep in mind that type A have 80 friends and type B have 20 friends. The question I have for you is, what are the chances you picked a type A friend? This should be a number between 0 and 1. I'm also going to ask you for the opposite. What are the chances you picked a type B? Please enter both numbers, and these numbers should sum up to 1. Submit and then next. I should warn you this is a challenging question. If you don't get this right, don't worry. This is the type of stuff you'll know when you've taken the class. I just want to tease you a little bit in the beginning.
Here's the interesting finding. Because type As are so much more popular, your chances of linking to a type A is 0.8. To see, let's take the extreme view. Suppose type Bs had 0 friends. Then you'd never link to a type B. You link to type A and to type B in a proportion of 80 to 20. Type B would be 0.2. That means most of your friends you link to are type A. They're the type of people that happen to be popular.
Let's now go back and ask the real question. In expectation, how many friends does this friend of yours have? Please put your number right here, and again, it's a challenging question. Don't get disturbed if you don't know the answer. This type of stuff we'll study in and out.
Here is my surprising answer. It is 68 friends. Who would have thought? The way to get there is with 0.8 chance you'll pick a type A who has 80 friends and with 0.2 chance you'll pick a type B who has 20 friends. If you work this out this is 64 + 4, makes 68. Your friend in expectation wold have 68 friends where you would only have 50 in expectation. So, sorry, I think you are unpopular in expectation.
Let's talk about this class. Most of the material is very basic. It's the first class you would have in college if you're not a statistics major. We teach you things like how to visualize data, how to summarize it, how to run test, and even find trends. But there are also a few nuggets in there that are challenges. These are optional, and they're clearly marked as optional, but I will let you prove some theorems along the way using little games that I play with you. Most important, you'll be afforded the possibility to program the things you've learned. Again, this is optional, because I don't expect you to have a programming background. But give it a try. I believe that through programming you learn the material much better than any other way. It's optional. You can assimilate all the material without it. But I would recommend to give it a try if you know how to program.
Welcome to my online class on statistics. The basics of statitics is that the world is full of data, and we the people have to make decisions. Statistics comes to our rescue. It takes data and turns it into information that we the people can use to make decisions. Whether you are in social sciences, medicine, engineering, public policy, psychology, climatology, robotics, even archaeology, health sciences, finance, business and marketing, or pretty much any other discipline that you can study. All of those are now driving by data, including unlikely fields like biology or physics and so many others. Statistics is an amazing discipline to know. It is universal, useful, and fun, as I hope you're going to see in the class I'm just about to teach.
One of the standard problems that people study in statistics has to do with purchasing decisions. Suppose you wish to buy a house. There are small houses and big houses, but you really like this one special house build by a famous designer. This house has a certain price. Say in US dollars it's $92,000.00. The question you'd like to ask yourself--is this okay? Is it too much--you should pay less--or too little? Let's go an find out. In statistics, the way we find out is by looking at data. Let's assume there is a database of previous house sales of homes in the same neighborhood. Just for simplicity, let's assume we know about two things-- the size of the home and the cost at which it was sold. There is a house with 1400 square feet that sold at $112,000, a much larger one with 2400 square feet sold for $192,000, and so on for an entire number of other houses. Now it's a statistics problem. You have past data, and here is your very first quiz. Say the house you wish to purchase has 1300 square feet in size. How much money should you expect to pay?
Well, in our very first quiz we're just going to look it up. It turns out there was other sold at the same size, and it brought in $104,000. In the interest of statistics, the answer is $104,000. This is not the game theory class. Obviously, you wouldn't want to bid that much, but in the interest of statistics, that's what you'd expect to pay.
Same question now with $1800 square feet.
And yes, the answer is $144,000.
A more tricky question--what about if the house you're trying to purchase has 2100 square feet?
That one is tricky. I'm going to answer it the following way-- If you take halfway between $144,000 and $192,000, we get the mean of 144 and 192 thousand, and that is $168,000.
Now, that isn't always correct. I assume that this data has certain properties, which I'll talk about later, but let's move on and assume we can use the trick of finding prices just in between existing prices to price other sizes of houses like 1500 square feet. Please put your answer right here.
Yes the answer is $120000. By our logic, 1500 lies between 1400 and 1800. In fact, it's a quarter away from 1400. We'd say it's $112,000 plus 1/4 of the way from 1400 to 1800. That's a difference of the price of an 1800 square foot home and a 1400 square foot home, adding a quarter of this gets us from $112,000 to just $120,000.
I guess by now you've probably figured it out, but let me ask you just to make sure you understand the logic behind this specific data set. What is the cost of the home per square foot? Here's my square foot, and please answer in the box over here.
The answer is $80 per square foot, and we get this by just dividing $100,000 by 1400. It turns out that this data set has this amazing property that the cost per square foot is constant. That allows us to interpolate the way we just did. In statistics that's often not the case, but I want to congratulate you. You did your very first unit of statistics. Congratulations. You completed Unit 1. But as we go forward, we're going to look into data where the cost might not just be a constant factor times the size of a home. See you in the next unit.
Welcome to Unit 2. Today we talk about the most important aspect of any statistics person. I want to quiz you what it is, knowing that you can't know it. But I want you to ask you first before I tell you. What is the most important thing a statistics person does? Look at data? Program computers? Run statistics of the type we'll discuss in the future obviously? Or eat tons of pizza? Just check one of those boxes.
Now I can tell you I'm tempted to check mark "eat pizza," but that isn't the most important thing a statistician does. The most important thing is to look at the data. I can't over emphasize how important this is. A great statistician differs from an okay statistician by the ability to spend a lot of time just looking at data.
Here is another data set. We have the sizes and square feet listed on the left, and of course prices and dollars listed on the right. I want you to look at the data. Particularly I want you to tell me is there a fixed dollar amount per square foot? Yes or no?
This data set makes it easy. The answer is no. There are two houses of the same size that sold at different prices, so it cannot be that there is a fixed dollar amount per square foot.
How about I change the house over here from 1400 square feet to 1300 square feet. What is the answer now?
It turns out now the answer is yes, and as I'm sure you verified, the cost per square feet is 70 dollars.
Obviously, the data carries a lot of information. I'll now teach you how to visualize the data, using a simple trick called a scatter plot. For that, I want you to take a piece of paper and a pen or a pencil and arrange the data in a graph where the x axis is the size, and the y axis is the price. In a scatter plot, each data item becomes a dot. If we graph different sizes horizontally and prices vertically, the very first data item can be graphed as follows. You draw a line that's vertical at size 1400. Every house on this dotted like is of the same size--1400. You draw a horizontal line at the price of $98,000, just below $100,000. Where these two lines meet you have your very first data point. This dot corresponds to the very first house in your list of data. Let's do the same for the second house.
And again, the size is 2400. The price is $168,000. This is the right answer.
Let's do this for a third time for the third house on the list. Which of the boxes represent best the third house?
For a house of size 1800 square feet we pay about $126,000. This is this the dot over here. I think you get the story. In a scatter plot, each data item becomes a point. We conveniently chose a 2-dimensional list to make for a 2-dimensional scatter plot. These are the most popular scatter plots because surfaces like paper are 2-dimensional. It's really hard to do it in like 125 dimensions. Being as it is, when I draw in all the six data points, I get a scatter plot like this. That's a nice scatter plot, because I can draw a line straight through all of the data points. Now when this happens and there's a relationship that's governed by a straight line, we call the data linear. Now, linearity is a rare concept in statistics. Very often you'll find deviations, and that's because the size of the house is not the only determinant in the cost or perhaps because most of us are bad negotiators. But when a data set is linear, it's really easy to predict prices of houses in between. For example, a house of this size ought to cost this much. You can read out of a scatter plot the dependence of the size of a house and its price. Here we are--we're actually looking at the data. We're doing what a statistician ought to do.
I'll now ask you to make your own scatter plot. Please graph this new data and tell me based on the graph-- please really draw it on a piece of paper-- whether the function between price and size is linear-- that is, can you draw a line through all the scatter plot data on your diagram? Is this linear? Yes or no?
To answer this, let me make the scatter plot. Here is my x-axis, which measures size. Here is my y-axis, that depicts price. If I graph the different sizes, I get those 6 data points over here. If I graph the different prices, I get the lines over here. Sure enough, connecting those things get's me a data set that happens to be exactly linear.
Just checking, do we believe there is a fixed price per square foot? Answer yes or no.
Yes, it's $30 US per square foot.
Let's do this again for different prices. I just changed all the prices of those homes. Let me ask you the same question. Do we believe there is a fixed price per square foot? Check yes or no.
The answer is no. If you look at the first data point and divide the price by size, you get approximately 31.1765. Compare this to the next data point, which gets you 30.9524. They're about the same but not quite. In fact, they're all different for the different data points, as you can see over here.
Let's make a scatter plot. Please take a piece of paper and graph the size relative to the price as you are now used to in scatter plots. My question for you is when you graph this is the data linear. That is, can you fit a line through the scatter plot data? Yes or no? Please graph it very carefully and then make your judgment. It's a non-trivial question.
Amazingly the answer is positive. Let me graph the data--first data point, second, third, fourth, fifth, and sixth. While my hand drawing isn't perfect, if you're careful you'll find out that they're just linear.
That's amazing--even though there is not a fixed dollar price per square foot, the relationship is linear. Here is a a really challenging question. In this data, the price is linear in the size plus or minus a constant dollar amount. Can you fill in those values? I should say this is a non-trivial question, and I don't expect you to get this right, but it's a wonderful, wonderful preview of what I'll teach you to do in this class very automatically using really amazing pieces of statistics.
The answer happens to be the square foot costs $30, but there is a constant cost added of $2000. If you plug this in, you'll find, for example, that 1700 times 30 gives $51,000, but if you add the $2000, you'll correctly get $53,000. Do it again with 2100 square feet over here multiplied by $30, add your $2000. You get $65,000 and so on. Now, again, arriving at those numbers is nontrivial. I wouldn't be surprised if you weren't able to complete this exercise. But if you did, you're quite amazing, because that is the relationship between the price and the size in my example.
The answer happens to be no. If I draw the original blue data points, the data is indeed linear, but this data point over here of a house of size 2100 square feet sold remarkably cheap, so we get the point right over here. Similarly, the house of size 1300 square feet sold for a lot. Both of these data points are outliers. We'll talk about outliers later. There is clearly no way to fit a linear function through all those data points. I would even question that this wicked kind of function that I drew is a good accurate representation of the relationship between size and price. You would not expect that as you go south of 1500 square feet that the price would rapidly increase and then decrease again because there is likely nothing special about small houses than justifies the price over here.
You finished Unit #2, and you now know a lot about scatter plots. They tend to be 2-dimensional, and real-world scatter plots look often just like this where a simple eyeballing tells us something about the relationship of one variable to another. Now, scatter plots aren't great when there is what is called "noise." That is, the data deviates from the exportation in some random, noisy way. In the next unit, we talk about a simple plot called "bar chart" that addresses the issue of noise in data by pooling data points into a single cumulative bar. So stay tuned.
In Unit #3, we'll talk about bar charts, which is a common statistical data visualization tool. Let's look at our housing data again. This time I'll order the houses by size increasing. Here are the associated house prices from $88,000 all the say to $98,000 for these different sizes. Just as a warmup, I'll ask you a quiz that really belongs to the last unit. Is this data linear?
Without plotting it, I suggest the answer is no. The size goes up monotonically, but the cost jumps up and down. You won't find a way to drive a linear line through the data when the size is increasing by the cost is bumping up and down. I'll leave the drawing of a scatter plot to you as an exercise, but it looks about like that.
If I now ask you how much to pay for a 2200 square feet house, and you use the interpolation method that we learned in the very first class where we looked at the two nearest data items and interpolated linearly, what would you get, and in a minute I'll ask you, do you have trust in that number?
What you get is the halfway point between 2100 and 2300. That is, 105,000.
Let me ask you, do you have trust in this $105,000 number? I really want you to say, no, so please take your vote.
And, yes, I don't have trust in this. The reason is from a 2100 square foot home that's clearly small than this one the price has decreased. It'd gone down. Do we believe that in general between a 2100 square foot home and a 2300 square foot home the price should go down? Do we actually believe that in a scatter plot like this the functional relationship between size and cost goes like this? Or do we instead believe it should go like this? The deviations from that linear graph is what's called "noise." Now, noise might not be the best term, but it's the term that statisticians use. It might be that one of them has a great view. The next one has an old house. This one has coastal access, which makes it more expensive. This one really requires a new kitchen. There might be factors that really effect the house price beyond the size. But if those factors are unincluded, to a statistician that's called "random noise." But coming back to my original question, I think we don't believe the red curve. Let's talk about bar charts as one way to alleviate the problem.
In a bar chart, we take our raw data and pull it together. For example, we might say all the data that falls into this interval over here should be summarized by a single value. Such a value would lie halfway in between these two data points and form what's called a bar. Similarly, we might pull together this data over here into a single bar and so on. Let me ask you a question. What is the height in terms of the dollar figure for the very first bar?
The answer is 80,000. It's the halfway point between 88,000 and 72,000, which is 80,000.
And the answer is 90,000, which is just the halfway point between 94,000 and 86,000, and these are the two data points that fall under the second bar.
I'm sure you get it to give me the number for the third bar.
These are the two points that fall into the third bar. The mean value here is 105,000.
If you look at the bar graph, what you'll find is it's a much finer representation of the data. By pooling together multiple data points into an individual bar, you can see that there is a much better way to really understand the dependence of cost to data. While the bar doesn't give you the linear relationship-- in fact, in this case, happens to be nonlinear-- it really gives you a sense as you go up in house sizes the cost increases, which wasn't obvious from looking at the individual data points. What the bar chart does is it really helps you to pool together groups of data into a single bar and understand global trends. Such global trends might not be that important if you only have six data points, but imagine you have 60,000. With 60,000 data points, you're scatter plot may look like this. I can tell you my hand isn't really able to draw 60,000 points. If you go to look at this data set, the individual data tells you very little, jumping in x parameter by a tiny bit might make a jump in y from here to here down to here to here and at some point down to here. Yet a bar graph can really help you under stand the data. Clearly, one of the things that a statistician does is to use cumulative tools, such as bar graphs, to gain an understanding of the underlying data. Let me ask you. Are bar charts cool? Just check one of the two answers.
Honestly, if you check no, perhaps this class isn't for you. Perhaps you don't share my excitement about looking at data and using simple tools like bar charts to really understand what's going on. But if you checked yes, you're on your way to become a statistician.
I want to talk briefly about histograms as a special case of a bar chart. The key difference is whereas the bar charts we discussed so far were defined over 2D data, the histograms look at 1D data. That is, there is only one dimension of data that is being plotted. Let me start with an example. Here is a fictitious data set about annual income. Suppose at some company I asked software engineers how much annual income they make. Again, this data set is contrived. Of the nine people I asked, here is the survey of different annual salaries. In the histogram case, I make a bar chart that consists concerns itself with only one thing, which is called "frequency," which is short for "count," that will group these salaries into three different buckets-- from $120,000 on, $130,000 on, and $140,000 on. What the bar chart plots is the frequency at which people asked fall into the different categories. Specifically, I am asking you what is the count for the salaries that fall into the $120,000 to $130,000 bucket. Please answer it here.
The answer is 5, because all the salaries marked here fall between $120,000 and $130,000. So the bar in the histogram plot for this interval would be 5 high.
Give me the same number for the next interval.
Yes, the answer is 2. There are exactly 2 elements that fall between $130,000 and $140,000. Obviously the final bar is of size 2 again, because there are exactly 2 elements. Now, this histogram differs from the bar chart in that the vertical axis is just a frequency count whereas before it might have been a median home sales value. In 1-dimensional data sets that are numerical, this could be informative. You can say, for example the majority of workers in this company are in this salary bracket where as a much smaller number are in higher salary brackets.
A famous histogram can be obtained by looking at the age distribution. For the USA, the distribution looks about as follows. Again, going to the statistics is an endless number of actual ages and so on. Let's make a histogram that is somewhat simplified that only looks at people from age 0 to 40. Here is the data set from 21, 17, 9, 27, and so on. I'm asking you for all these four graphs in a histogram how high are the bars for each of those four ranges as shown over here. Again, the horizontal axis depicts age, and the vertical axis the count. Please, enter your answers here.
From 0 to 10, we have 5 individuals. From 11 to 20 it's 7. From 21 to 30, it's 1, 2, 3, 4, 5. And finally from 31 to 40, it is the remaining 5 data points. In this example, our chart would look like this. This is a histogram that depicts the count in this data set as the function of the range.
In this unit you learned about bar charts and histograms. They both use vertical bars, and they both aggregate data. The big difference was that the bar chart is defined over 2D data. The one dimension applies to the x axis and the other to the y axis whereas histograms only apply to 1D data where the y axis becomes the count of that data. In the next unit we'll encounter another plot. Without giving it away, it is some how related to this birthday pi. So stay tuned.
Let's now talk about pie charts. You all know what a pie is. Here's a birthday pie, and if you look at this pie from above it looks just like this. It's a circle. Now here comes Sebastian and cuts out his first piece of the pie. What results is a pie with a missing piece, and then my wife comes, and she gets just a small piece, but my brother eats a very big piece, so only the following pie is left. We've just made a pie chart, and in statistics you use pie charts to visualize data, specifically relative data, and I'll tell you in a second what that means.
Let's start with an exercise. Suppose we are in an election, and there are two parties--party A and party B. And suppose it's a toss up, so both parties are getting the same number of votes--or 50%. Here are three pie charts. Will any of those reflect the outcome of the election? Perhaps the first, the second, or the third, or none of the above. Please check exactly one box.
The answer is the second is actually a good representation of this outcome. When I color the pieces of the pie, we'll see that 50% of the pie falls into one class and the other 50% in the other class. Compare this to this pie over here, which seems to be a 75 to 25 split, or the one over here, which is 25 to 75. This is the one I would've chosen.
Now, I said that pie charts are good for relative data. To illustrate this, suppose party A go 724,000 votes and party B got 181,000 votes. What is the percentage of votes that part A got? What's the percentage of votes for party B? Enter your numbers here.
Well, with a little bit of math, we realize that this is 80%. The way to calculate this is 724,000 is the votes that party A received, but the total number of votes was 724,000 plus 181,000, which is 724,000 over 905,000. That's exactly 0.8, which is the same as 80%. It follows that party B got 20%. You get to that number when you replace 724,000 with 181,000--the number of party B. It turns out 181,000 divided by 905,000 is exactly 0.2. This is exactly 5 times as large as this number over here.
Now, given all this I will now draw a number of pie charts. I want you to select the one that most closely resembles the distribution over here. Just for clarity, party A is depicted in red and party B in blue. Please select exactly one of those pie charts.
To me the best answer is the last one. The reason being that this area of the pie most closely resembles 20% of the pie. Now, closely related is the first one where we cut out a quarter of the pie. But a quarter is 25%. So this is slightly smaller than a quarter. I think it's the best one to correspond to 20/80 in this pie chart.
Here comes a tricky question. Given that we have a pie chart with distribution 80% and 20%, I'm now changing the total number of voter. I'm telling you there were 23,000 people voting for party B, and I'm asking you how many voted for party A such that the pie chart over here is exactly the correct one with an 80 to 20 distribution.
The answer is 92,000. You can see this because 80 is exactly 4 times as much as 20. If you took 23,000 and divided it by 20% and multiplied the result by 80%, That is the same as just multiplying 23,000 by 4. You get that number over here. What's remarkable about this chart is it's invariant to the total number of votes. What it really depicts is the relative number of votes. It shows that A got many, many more votes than B. It shows it graphically, so you can see this without even studying the numbers. Let's practice this one more time.
Suppose you're taking a Udacity class. And I guess you're taking one right now, so let's stroke the "suppose". Among the students that take the class with you, you find the following age distribution. From 13 to 19 there are 12,000 students. From 20 to 32 there are 96,000 students. And from 33 on there are 36,000 students. I now want to construct the pie chart with you. Here's my pie. I want you to place the separator for the very first class of the age 13 to 19, which is the blue class. Please check the box on the perimeter of the pie chart that best places the separator. For example if you check the box over here,
This is not that easily solved. What ratio is 12,000 to the total number of students? Well, 12,000 over the total number of students ends up being the same as 12 over 144, and that's the same as 1/12. The correct answer would have been the check box over here. This area over here corresponds to the age group of 13 to 19.
Moving on to our dominant age group, please check on the perimeter of the box that best represents the separator for the second class.
The answer is the box over here. Some of you might have chosen this box, but I'll tell you in a second why this is the better box. This is the resulting pie chart where the red class is 20-32, and the remaining black class is the age group 33 and older. Now let's do the math. To understand what area the red curve will occupy, we're going to divide 96,000 by the total number of students, which is 96/144. It happens to be the same as 8/12. Now, the 8 mark is the one over here, but I chose the 9 mark. This is the 9 mark--1, 2, 3, 4, 5, 6, 7, 8, 9. The reason is one mark has already been used up by the first class, and now we incrementally add 8 to those to arrive over here. The surface area in the pie shown in red really corresponds to 8/12 of the total surface area. If we now plug in the final class of 36,000 students, which is the same as 36/144 or 3/12, you'll find this area over here occupies exactly 3/12 of the total area. Hence this is the correct pie chart.
Let's do this again, using once again our election example. This time with four parties--A, B, C, and D. Here are the election outcomes--party A received 175,000 votes, party B 50,000, party C 25,000, party D 50,000. In most democratic parties, you don't find such a distribution that one party takes the vast majority, but say that's the case for our country. If I now draw a pie chart, let us assume that we try to graph party A first, then B, C, and D--as indicated over here. Please check exactly those boxes that define the separator from one party to the next.
What we'll find is that party A got 7/12 of the vote, which is the majority, whereas the other ones only received 2/12, 1/12, and 2/12. So if we go forward 7 pieces in this diagram--1, 2, 3, 4, 5, 6, 7-- we check mark this box. Another 2 gives us this box. Another 1 gives us this box. These are the final two. How is how this chart will look like. Obviously A takes the majority outcome, B reaps this slice over here, C is the smallest party, and D is the one over here.
As a final question, I will now tell you that in a different election where the same pie chart is correct we had a total of 240,000 voters, which is the sum of all votes cast. Assuming that this pie chart here is correct, can you tell me how many votes are cast for each of the parties?
The answer lies in the numbers I just wiped out. If A got 7/12 in total, we know that 1/12 is 20,000. So A got 140,000, which is 7 times 20,000. Party B got 2/12 or a 6th, which is 40,000. Party C--a disappointing 20,000. And party D the same as party B. If we look at the diagram, it tell you nothing about the absolute numbers. In fact, I can change the absolute numbers. As long as the relative percentages stay the same, it does however tell you a lot about the distribution of the data. It shows you that A is the dominant party that got more than 50% of the outcome whereas B, C, and D occupies smaller slices with a slice for C being half the size of B or D, respectively. Again, this is called a pie chart. Pie charts are really, really powerful to represent things like election outcomes. In any data set we just care about the relative outcomes and perhaps have more than just 2 classes.
Congratulations. You just learned about pie charts. They're great for relative data, and they're wonderful for comparing which slice of the pie is biggastroesophageal reflux. In the next class we'll look at relative data again, and we'll pick up the touchy issue of gender discrimination in college admissions, using a study originally performed at UC Berkeley here in California. This'll be a really deep statistical question, and I promise you you'll be surprised by the result.
Now comes the moment of truth. I have been waiting for this moment for quite a while to let you experience first hand how to visualize data, and in this unit, you will do everything entirely by yourself. I will give you data and you will plot those data and answer simple questions. But there is one thing to know--this unit is strictly optional. The reason is there is programming involved in this unit and programming was not a prerequisite. However, I will make sure what you have to do is extremely simple, is very well explained, and a whole lot of fun, so if you choose to take this unit, you will visualize some interesting data using a very simple programming tool. This over here is the Udacity programming interface. Don't freak out--all I'll do is give instructions over here, and I'll hit the run button. As I scroll down, you can see a histogram of the data. So this has been computed automatically now. There are five data categories from 2.0 to 6.0, and you can see the frequency of data ions in the vertical dimension--1, 4, 3, 1, and 1 again. Now this is a familiar histogram. Just look out how easy it was to generate this.
Here are 3 lines of things that the computer has done for me. The very first one you should just ignore. It'll always be there. Just ignore it. It tells the computer that we wish to plot things. The second and the third one were the important ones. I define a data set--a data set is a list of 10 elements--3, 4, 2, 4, 3, 5, 3, 6, 4, 3 that I made up and with this line, I tell the computer that this over here is my data. Then I tell the computer, "Please histogram plot my data"--and then this is the result I just shown you. Here are the 3 things that we have to tell the computer. If you ignore the cryptic "from plotting import *," then what we do is we define the data. We list it, we give it a name called data, and we generate a histogram plot of the data. Then we then hit run in the small box down here--you'll get exactly the plot that I've shown you where the frequency of the data is plotted accordingly. Obviously, that range was 2.8 with 3.6 as the most frequent and let's go back up to the data. We find yet 3 occurrences of 3 with 4 into this range. This is the most frequent range.
What I want you to do next is look at a more complex data set. In this data set, we study the height of people in inches from a data set I got from the web. Here is the data set of people and their individual heights--all ranging between about 65 and 72 inches, and I want you to do a histogram plot following the example I've given you. Remember histplot is the command for histogram plot, and I didn't call the data "data." I actually called it height--so try to assemble the right command to plot the data and look at it because I'll ask you a question about it.
If you got this correct, then your histogram will look pretty much like this thing over here. There's a big peak in the middle and very short people and very long people are much less likely than medium-sized people. I'm covering up the ranges because I'd like to ask you a question here, but I will tell you that the one command that had to enter was histplot with height as an argument, and if you play with it, you'll realize you have to match the uppercases of the height, but you can plug it into histplot just the same way we used data before. Let me ask you a quiz.
For our data set, what is the most frequent height? Is it 63-65 inches, 65-66, 66-68, 68-70, or 70-71? These are the ranges that your histogram is sure to pick.
And the answer is a resounding 66-68 so this is the correct number. Now if you look at these ranges very carefully, they aren't all the same size it seems. This seems to be more like 2 but this range seems to be just 1, and the reason is these numbers are truncated. They actually have decimal points that are not displayed but regardless, this category wins hands down in the statistic.
Let's test this again, and this time we study the weight of people. I'll give you a data set of weight, I ask you to do the histogram and answer a question about the most common weight category. Here's a data set called the weight that you can look at, and I want you to do a histogram for the weight that will look as follows. Please go ahead to tell the computer to make this graph and hit the run button.
As before, the command is histplot with the variable of weight as its argument. This generates the plot I told you, and I'm going to ask you a question about the plot.
What is the most frequent weight in this histogram-- or Sebastian's weight, which is a mystery? Please check the exact line box.
The answer, if you program histplot the way I did, is 119-130 pounds. This is the correct answer, and no I won't tell you what my weight is at this point. But I can tell you it's much more than 153.
Let's now look at the combined data set of height and weight. It turns out both lists have the same elements, and the very first element in the height list is the same person as the weight in this list over here. The same is true for the second--so these data pair up conveniently to plot the height of a specific person and his or her weight. I now want you to run a scatter plot. Remember how a scatter plot looked like. Particularly, I want you to graph the height versus the weight, and the command to use is called scatterplot and accepts these two variables as input. I'd like you to try these out for the data I've given you.
All you have to do is to enter the scatterplot command the way I've given it to you, but let me now ask a question about the plot that you've seen.
Is this data exactly linear without any deviation from linearity, approximately linear--that is can you see a linear trend, or at the other extreme are height and weight pretty much unrelated--that is you can't really see a dependance between height and weight? I know there are other possibilities here but I just want to narrow it down to these three possibilities to derive a very crisp and clear answer.
To answer this question, let's look at the scatter plot--you see there're certain sizes of people and then there are certain weights associated with these people, and it's hard to make out but I think there's a dependance whereby the taller a person is the heavier the person is. Now, there's lots of exceptions--this person is a remarkably heavy person for size, this is a remarkably light person for about the same size, and we all know that some people are heavier and thinner and thicker for these sizes, but if you'll expectively, you can see a slight upper trend that appears to be approximately linear. We can't tell this right now--we don't understand how to analyze where the data depends as a linear but you get the juice of it. By looking at the data alone, you can tell that weight increases with size. So for now, I'll check the second box over here because the other ones are obviously incorrect.
As a final exercise, I'd like you to replace scatter plot by bar chart again with the same two arguments as before--so please go ahead and tell the computer to generate a bar chart.
Here's the command bar chart height and weight, and as I scroll down, I will find my bar chart that shows pretty much how larger sizes was that in larger weights. Now this chart would question whether it's exactly linear. You can see the distance over here to be significantly smaller than the distance over here but it's all approximate, and in reality, I can tell you the ratio between height and weight is not linear. But for the sake of the exercise, that's a better statement than any of the other statements that I've given you.
Let's now look at the data set that relates age to wage--how much money you make. Most of you out there are young, so your preference might be that the world pays young people the best, but in many countries wage goes up with age until perhaps you hit retirement age. Which one is it? Let's plot the data. Here is a very elaborate data set of people of certain ages and their wages that they receive. I want you to make a scatter plot for this data. You'll realize that the highest wage is $267,000, and I want you read off the scatter plot what is the youngest age in this data set where this wage is being realized.
Here's the scatter plot command. And without looking at the result, I now ask you about the question. What is the youngest age at which a person earns $267,000?
What is the youngest person to earn $267,000 in our data set? Look at your scatter plot and you can read it straight off. As a hint, because it is the largest wage, it'll intersect with the vertical box of the diagram.
The answer is 30--as I go to the scatter plot that graphs age versus income, you'll find that up here is the very first time someone had the maximum wage. That specific number occurs 4 times--here, here, here, and here among the older crowds of 35-pluses might made it--my own age isn't even on the diagram anymore. But the correct answer would have been 30, and it's really easy to see from the scatterplot.
Now do me a favor and please make a bar chart of the same data. And after you've done it, I'll ask you the same question of whether there's an approximate linear relationship between age and wage.
Here's the command for making a bar chart. Let me ask you the question.
Is the relationship between age and wage exactly linear, approximately linear, or there seems to be no relationship? Exactly one of those boxes is correct.
I would say approximately linear--if you go to the barchart, you'll find that these bars kind of nicely lined up almost like a line with an exception that over here it levels off a little bit. Linear seems to be a good guess as to what's happening--wage increases linearly with age. The reason why it's not exact is more obvious from this scatter plot which shows the raw data. You can see even with higher ages around 40, they're still individuals that earned relatively little money--so for the data to be linear, we get these to line up exactly on the straight line which is not the case in this data. On average, the relationship might be linear, but when it comes to the exact data points, it's clearly not linear.
Finally, I'd like to ask you a question of what age group is most frequent? Remember we have a specific plot to look into this and I gave you five options--18-22, 22-26, Enter the appropriate command over here, look at the output, and then after you're done, come back to my question about what age group is most frequent.
The correct plot is, of course, histplot over age. We're going to ignore wage for this question.
Now here's my question--look at your output and answer which of these age groups happens to be the most frequent.
The answer on our data is 35-40. Here is the actual histogram plot and if you look at this, the biggest bar is the bar on the right, which corresponds to the age of 35-40--so this would have been the correct answer.
Congratulations! You've finished something really complex. You've made your own bar charts, your own histograms, and your own scatter plots. Isn't this a lot of fun? Isn't statistics really great? Did you learn that by just looking at data, you can already understand a lot. So the purpose of this unit was to empower you to instruct the computer to plot data, and the reason why I told you this is because the very first step every statistician does is to look at the data if you're given data. Now on the next unit, I will give you something that'll bend your mind. I will give you data that depending on how you look at it will get you a very different conclusion. Stay tuned and check out the next unit where I will bend your mind.
In this unit, I want to show you a problem that will illustrate how deep statistics can actually be. Statistics is not just a superficial field. In fact, in this unit I will show you a problem that will blow your mind. I promise you will think about this for a long time to come. So, let's just dive in.
The problem I'd like to tell you about is motivated by an actual study the University of California Berkeley, which many years back wanted to know whether it's admissions procedure is gender biased. I looked at various admission statistics to understand whether than admissions policies had a preference for a certain gender. And while the numbers I'll be giving you are not the exact same that UC Berkeley found, the paradox is indeed the same and is often called, "Simpson's Paradox." I'm just giving you a simplified version of the problem. Here is the data. Among male students, we find that from 900 applicants in major A 450 are admitted. Please tell me what the acceptance rate is in percent.
Obviously, it's 50%.
In a second major B 100 students applied, of which 10 were admitted. What is the acceptance rate?
And the answer is 10%.
The same statistic was run for female students. Again, I made up the data to illustrate the effect. Females tended to apply predominantly for major B with 900 applications for major B and just 100 for major A. The university accepted 80 out of 100 applications in major A and 180 out of 900 in major B. Please tell me the rate of acceptance in percent for major A for the females student population.
Of course it's 80%--80/100.
Please do the same for the major B in the female population over here.
So, just looking at these numbers for the two different majors, would we believe--in terms of the acceptance rate-- is there a gender bias? Yes or no?
And I would say yes, in part because the acceptance rate is so different for the different student populations, even though the numbers are relatively large. So, it doesn't seem just like random deviations. But the thing that will blow your mind away is a different question.
Who is being favored--the male students or the female students? And looking at the data alone, it makes sense to say the female students are favored because for both majors, they have a better admission rate than the corresponding male students. But now, let's do the trick. Let's look at the admission statistics independent of the major. So, let's talk about both majors, and I would wonder how many male students applied. And of course, the answer is 1000. How many were admitted?
And the answer is 460.
So, what is the admissions rate for male students across both majors in percent?
And the answer is, of course, 46%. It's 460/1000 x 100%.
Now, do the same for the female student population, maybe at 1000 applicants, same number as in the male case, and 260 students admitted. So, what's the percentage rate for admission?
The answer is 26%.
So, across both majors, I'm asking you the same question again now. Who is actually being favored? Males or females?
And surprisingly, when you look at both majors together, you find that males have a much higher admissions rate than females. I'm not making this up. These numbers might be fake, but that specific affect was observed at the University of California at Berkley many years ago. But when you look at majors individually, then you find in each major individually the acceptance rate for females trumps that of males, both in the first major and the second major. Going from the individual major statistics to the total statistics, we haven't added anything. We just regrouped the data. So how come, when you do this, what looks like an admissions bias in favor of females switches into admissions bias in favor of males?
I showed you this example to illustrate how ambiguous statistics really is. In choosing how to graph your data, you can majorly impact what people believe to be the case. In fact, a famous saying goes, "I never believe in statistics I didn't doctor myself." I'll let you guess here who this is being attributed to-- Mark Twain, Oscar Wilde, or Winston Churchill. Check the Web, and see who invented this famous quote.
And even though I don't think Winston Churchill invented it, in the course of World War II the Germans tried to associate this quote with him to make him less credible. Be it as it is, the key lesson here is statistics is deep and often manipulated. One of the tricks I'd like to teach you is to be skeptical of statistics, of your own results, of other people's results, and really understand how to turn raw data into decisions or conclusions. I hope that this simple example made up think. Stay tuned when we dive into the basics of statistics, which is probability theory, in the next number of units.