These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
Now you will learn about the three Ms in statistics:the mean, the median, and the mode. And these are terms that are really important to know.They are useful in looking at data, and they are often confused.So let me bring clarity to the story of the three Ms.Let's talk about the first.We talked quite a bit about house prices,and here is a list of possible house prices.The mean is using the exact same formula as in the previous unit.It's the average of all those prices over here.Do me a favor and quickly compute it for those numbers over here.When you put in the number, don't worry about the thousands.It's something between 165 and 190.
When you add those up,you get 870,000.This is the total sum of all the house prices.We have to divide by the number of houses, N=5.870 divided by 5 is 174.
That's the mean.Why is this useful?The prices I just gave you are really the type house prices you find in the region over here, which is a small section of Pittsburgh, Pennsylvania,where I lived for many years.You will see they go between $165,000 and $190,000. By computing the mean of $174,000,you can really characterize this section of this neighborhood of Pittsburgh.When we look over here, we find most house prices are actually a little bit cheaper--125, 148, 110, 160--but there are two outliers .One is 325K, but the big one is the one over here: 2.4 million.So do me a favor and compute the mean for this section of town over here.And remember, 2.4 million is the same as 2,400K.
And the answer is approximately 492.6K.
Aren't we suspicious of this number?Most of the homes here are in the $100K range. There is one of $325K and one of $2.4 million.The mean doesn't really reflect this.It makes you feel that the average home in the neighborhood is $492,000.00.But that's not a good description of what's happening.Instead, we have a situation where we can graph the price and the frequency,and we have a peak around $130K. But then there is a really, really long tail that gets us all the way to $2.4 million.And the sad part is rather than finding the peak over here,we find ourselves somewhere in the middle.And the reason is this one number of $2.4 million drags us to the right side and really has a strong impact on the mean,which ends up to be just below $500K. This is exactly where we get the second M, the so-called median.The median is a different statistic that seeks to find the typical house price in this list--the one in the middle.Put differently, if you sorted those, the median picks the one in the middle.I'm sure you can do this in your head,so please enter the median house price over here.
To solve this, let us sort the house prices by their increasing numbers--110K, 125, 148, 160, 180, 325, all the way up to 2400--and that makes it obvious the one in the middle is 160K. So $160,000 is the median house price.If you look at this neighborhood again,I would say the 160K is a really good characterization of the typical house price you find in this neighborhood.It's much, much, much more reasonable than the nearly 500K that the mean would provide us.
So we've learned about two of the three Ms:the mean, the median, but not the third,so let's focus on the third, the mode.I will use as an example a children's birthday party.If you've ever been to one of those,there's usually kids and some of their parents. This one happens to have 5 kids and 4 parents,or 9 people in total.Let me write down the ages of the kids and their parents:4, 3, 32, 33, 4 again, 32 again, 3, 38, and 4.So one of the things we can compute for this is the mean age of all the people at the party.
As I'm sure you figured out, the mean is 17.If you sum all these things up, you get 153,divided by 9 gives us 17.
Now, obviously, the mean isn't particularly informative. There are really no teenagers at this party.There's either toddlers or small children or parents.This is one of these cases where the mean is giving you bad statistics.So let's move on to the median.What's the median?
To find this out, we sort the list.Here are the ages of the kids,and here are the ones of the parents. Now we see the median is 4.Is this a useful statistic?Not really.
Suppose another two parents show up of ages 35 and 36. What would the median now be?
And the answer is if we add 35 and 36 to the sequence, our median shifts here,and it becomes 32.
So with a small modification of the people at the party, our median shifted from 4 to 32.That's kind of odd.The mode is the age that's most frequently represented at the party.We already encountered modes when we looked at bar charts and talked about this a little bit to find the most frequent bar.Here, because our ages are discrete,you can give me the most frequent age at this party.
The answer is 4.It is irrespective of whether these two parents show up or not. Four is the most frequent age at the party.
So the mode is a useful statistic if your distribution of data is what is called multimodal. In multimodal distributions, you have a curve that has multiple bumps.This one is called bimodal because it has exactly 2 bumps.The mode is the highest value of the largest bump,which now occurs here with the value of 4.So in a situation like this, the mean is usually irrelevant,It falls just between the data.as we showed with our house prices example.The median may be in the middle or flipped from left to right and really doesn't say much either.The mode, too, can flip from left to right.At the very least, it corresponds to something that is very frequent in the data.So for bimodal data, or for multimodal data,which is data with even more than two bumps,the mode is something you should really calculate.So let's practice.So let me give you a list of numbers: 5, 9, 100, 9, 97, 6, 9, 98, and 9.Compute for me the three Ms: mean, median, and mode.
For the mean, we add up those numbers.The sum is 342 divided by 9, because it has 9 numbers,gives us 38.For the median, we re-sort them,and we find that 9 is the number in the middle.And for the mode, it's easy to see that there are 4 occurrences of 9.Every other number only occurs once.Obviously, sometimes there could be ties.There could be 2 elements in the center or 2 numbers that could be for the same mode,and we just assume that ties are broken at random.So pick either number; it doesn't really matter.
Here is one final sequence:3, 9, 3, 8, 2, 9, 1, 9, 2, and 4. Give me once again the mean, the median, and the mode.
If we sum all of those up, you get 50, so the mean is 5.The mean is always precisely determined. The median is a bit more difficult to compute.If you sort the numbers, you get this sequence here.Both the 3 and the 4 sit in the center,so I accept both as valid answers, either 3 or 4.The mode in this case happens to be 9.But realize had there been one fewer 9,there would have been other candidates for the mode as well.So, with one less 9,2, 3, and 9 would have all been valid answers for the mode.
This finishes my class on the 3 Ms.In the next one, we'll learn about variance and standard deviation. These are actually fun topics relating to the spread of data.We learn the difference between data like this and data like this that is much more spread out, even though the 3 Ms might agree in all of these data.And we're going to sharpen our skills to really understand how to look at data.