17.  Outliers

These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.

Contents

01 Ignore Data

Which of the following people routinely ignore data? Politicians, leaders of cults, statisticians, or none of the above? Check any or all that apply.

02 Ignore Data Solution

We know about politicians, we know about certain cult leaders, but did you know about statisticians?Did you know that statisticians routinely ignore data?

03 Should You Ignore

Here is an example.In your favorite sports club you get a list of the members,and it might look as follows. You are given a name and an age in years.If you look at this, is it prudent--or should you--ignore data when,for example, computing the mean age?

04 Should You Ignore Solution

And the answer is positive.There seems to be a 211-year-old Tom.Unless you are writing the Guinness Book of Records hereof having found the oldest-living person, this is likely just a typo.

05 Compute Mean

If asked, what do you think is an appropriate mean?

06 Compute Mean Solution

And yes the answer is 22, which is the mean of the remaining numbers. And what's really helping is--is that the real age of each club member.There is a relationship to the age in the database, but sometimes the age in the database gets corrupted.Something as simple as a typo and perhaps there's a certain chance of a typo, may be 10%but then any number in the database can be explained either by the real age or by the typo.If you have a sequence that reads 20, 21, 22, 23, 24 and 211, I would submit given what we know about the real ages the score over here is a result of a typo.And therefore we should not consider it in the calculation of the mean.

07 Quartiles

The easiest way to ignore outliers is called quartiles or their closely related cousin, percentiles.Suppose we have a data set the following form and here the data is shown in order--there's exactly 11 items. Then quartiles partition this data into four regions and look carefully, there's gaps in between.The element in the center that you've encountered--you might remember what is the mean, median, or the mode--please check one of those.

09 Finding Quartiles

These two elements over here are called the lower quartile and guess what--the upper quartile. And in fact this range in between the upper quartile minus the lower one is called the interquartile range.So this range is the data we used to calculate things such as the mean and data outside this range are the north.This is a simple but often very effective outlier removal technique that gives results extremely that often are attributed to things other than what you're trying to understand.Obviously this works well because there's exactly 11 items in our data.If there were only 10 items, you would've to shift ??? and you'll break a little bit of the symmetry but that's not a big deal because most of the results are very large.Now assuming it works for 11, give me the next number up that also works.

10 Finding Quartiles Solution

And this will be a 15 from reads 19, 23.The formula is we have to four quartiles, 4 times n plus three separating elements--the lower quartile, the median and the upper quartile. Any number that satisfies this formula over here has an exact definition of quartiles and anything else, you just pick something in between.It really shouldn't matter much in most cases.

Note:
The whole idea of quartiles is to break your data into four "chunks" of equal size. These "chunks" are demarcated by three data points: the lower quartile, the median and the upper quartile:

    |----------------|   L   |----------------|   M   |----------------|   U   |----------------|
Q                        D                        Q


Let's compute the size of the data, which is composed of the following data points:

• The lower quartile (LQ), the median (MD) and the upper quartile (UQ). That's three data points.
• The "chunks" of data demarcated by LQ, MD and UQ. All you know about them is that they are the same size. Let's say that the size of each chunk is N. That's (4 \cdot N) (4 "chunks", each one of size N)

<------ N ------> <- 1 -> <------ N -----> <- 1 -> <------ N -----> <- 1 -> <------ N ----->
|----------------|   L   |----------------|   M   |----------------|   U   |----------------|
Q                        D                        Q


Therefore the size of the data is (4 \cdot N + 3) (the sum of all data points).

The smallest possible value of each "chunk" is 1. If we increase the size of our "chunks" in increments of one we will get the following table:

    "CHUNK" SIZE       DATA SIZE
1                  (4 x 1) + 3 = 7
2                  (4 x 2) + 3 = 11
3                  (4 x 3) + 3 = 15
4                  (4 x 4) + 3 = 19
5 ...              ...


11 Compute Quartiles

Here's our age distribution again and they come at certain frequencies. Two of the people in the database are 19 years old, one is 20, one is 21,three at 22 and so on--there's 11 individuals.When I ask you the obvious question now--give me the lower quartile,the median and the upper quartile in these three boxes over here.

12 Compute Quartiles Solution

And this is easiest seen by providing all the extra data. Let me do this here, there's two of age 19, 3 times the age 22, two aged 23, 24 and 25 and now we can call this at exactly 11.This is where you'll find the lower quartile, the median and the upper quartile. That's 20, 22 and 23.

13 Trimmed Mean

Now assuming you'll use the method I just told you about, and you compute the mean after outlier removal,what do you get?

14 Trimmed Mean Solution

Obviously, these are the numbers with which to compute the mean. With my calculator, I get the total sum of 153 divided by 7 gives me 21.85.Here it is.

15 Compute Mean 2

Come in.  I'm making a class. >>Very nice. >>All right, now you know the answer, Tom.1489. >>Okay. So what's the mean of those numbers if you just say all these numbers?With me just now is Tom Mitchell professor at Carnegie Mellon University and Tom says a 1000.What do you say?

16 Compute Mean 2 Solution

The answer is about 300--290.6.I actually cheated. I used my calculator over here.

17 Trimmed Mean 2

What's interesting from now is I applied the quartile removal method.After cleaning up and removing the points outside the quartile range,what do you think the new mean will be?

18 Trimmed Mean 2 Solution

Professor Mitchell, what do you think? >>You throw away the extremes, the mean is 20. Okay, so you throw away the extremes, you arranged all points as before minus 99, 13, 17, 33, 1489, we find that these are the lower quartile and upper quartile,this happens to be the median--these are the numbers that are average--13, 17 and 33.You have one more chance Tom to revise your estimate.Okay. I'll go with 21. >>21! You did it.Professor Mitchell from Carnegie Mellon University with us today computed the quartile of this data sequence.Thanks so much! >>Thank you. Thank you.

19 Percentile

Let's talk about percentile. Percentile is kind of the same thing. The kth percentile is so to speak k% through the data.There's many ways to split.The one I'd adopt is, let's say, for the 10th percentile--this is 10% and this is 90%.The kth we applied on one side could be applied on both sides.For our age example, we removed the upper 20 percentile, we removed or exit data items here.Check the ones that we removed.

20 Percentile Solution

The answer is this guy. This data item is exactly 20% of the data on the upper range.

21 Statisticians Ignore

Statisticians ignore data, and it is with pride.So, do statisticians lie? Yes, no, or maybe?

22 Statisticians Ignore Solution

Well, that's kind of a fun question I didn't mean seriously.Based on what I've shown you here, I wouldn't call this a lie.But whether I can vouch that no statistician ever lied? Of course not.I'm sure some of them do.Let's move onto the next unit.