st101 » week-5 »

23.  Sebastian's Weight and Proofs (Optional)


These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.

Contents

01 Find Weight

Selection_001.png

In this unit here, we'll have fun.Somehow in the last couple of days on Facebook, a discussion brought out what Sebastian's weight is.And I decided rather than telling people how much I weigh I turned this into statistics.And upfront I want you to put everything together what we've done so far using programming and since programming has been optional in this class consider this unit optional but it'd be great if you had a chance to try it.It's not that hard and at the end of the day you'll know something about me that I rarely discuss in public.Through a comment I made in class on Facebook a discussion erupted in our Facebook STATS 101 discussion group what my actual weight is.And here is the form that I posted.

Selection_002.png

They were asked to submit their best estimate how much I weight in kilograms, and also to submit how much they thought I weighed a year ago.And within a few hours, there was a good number of guesses including this one over here that's about as much as the planet Pluto weighs and also some negative guesses.

Selection_003.png

These are both the negative weight of Pluto each. But other than that, there were lots of really good guesses.And you can see in kilograms, some people think I weigh 80 or 65, others think I weigh 250.I took the good guesses and added them into a large list called weight.That's just below 100 of those and now I want to do a statistics on those.The very first thing I did is I printed the mean estimate and it turns out to be negative.It's -2.10x10²⁰, and it's a typical situation in statistics.When you look at those numbers, most of them are actually pretty good guesses.But these extreme guesses of 10²² or -10²² over here completely affect and screw up the actual statistics.Now, you've learned how to deal with this. You know everything about statistics.What I want you to do is to now code a piece of software called calculate<u>weight</u>that has 3 things, and I think you can do all three of them yourself.First, I want you to remove the outliers by only extracting data between the lower and upper quartile. It turns out the number of data points make it well defined what the lower and upper quartile is.And all test cases we run through have a well-defined number of data points.And all the test cases we'll be using will have the property that the lower and upper quartile are well-defined elements.Then, I want you to fit a Gaussian using the maximum likelihood estimator.And from there, I want you to compute the value x that corresponds to the standard score z, so I'll be giving you not just the weight statistics or the weight data but also where my extra weight is.If you plug in the standard score of -2, which is two standard deviations below the mean of the data that we will estimate, you'll find out my extra weight that I took this morning.It's amazingly accurate.But definitely, the data that you guys provided for this was overestimating my weight and I'm happy to report by two standard deviations.All these formulas are known,and I think you have all the coding skills necessary from the past to fill these gaps.Obviously, the first step is the hardest.And when you're done with it, this command over here will give you the correct answer.

import random
from math import sqrt

def mean(data):
    return sum(data)/len(data)

def variance(data):
    mu=mean(data)
    return sum([(x-mu)**2 for x in data])/len(data)

def stddev(data):
    return sqrt(variance(data))



weight=[80.,85,200,85,69,65,68,66,85,72,85,82,65,105,75,80,
    70,74,72,70,80,60,80,75,80,78,63,88.65,90,89,91,1.00E+22,
    75,75,90,80,75,-1.00E+22,-1.00E+22,-1.00E+22,86.54,67,70,92,70,76,81,93,
    70,85,75,76,79,89,80,73.6,80,80,120,80,70,110,65,80,
    250,80,85,81,80,85,80,90,85,85,82,83,80,160,75,75,
    80,85,90,80,89,70,90,100,70,80,77,95,120,250,60]

print mean(weight)


def calculate_weight(data, z):
    # remove outliers
    # extract data between lower and upper quartile
    # fit Gaussian using MLE
    # compute x that corresponds to standard score z


print calculate_weight(weight, -2.)

02 Find Weight Solution

I won't give you a number, but here is my solution.The hardest part is to get the outlier removal correct.For that I sorted the data and I computed the lower and upper quartile. And these are the exact formulas I used in the exact case.It turns out even if it doesn't factor nicely, those give me good numbers.And then I compute a new data set where I just extract data in the range from the lower quartile to the upper quartile using this simple command over here.There's many different ways to implement this.Then, the maximum likelihood estimator is very simple. You've already developed the functions mean and standard deviation. I just use those.And they were supplied for to you in the code we gave you.And finally, this is the inverse of the standard score z that I've told you.If the standard score z is this expression over here,then you can easily solve it for x by bringing x to the left side.With this formula, we get this kind of code over here and we return the corresponding x value.Now when you hit "run" you get the answer to the fascinating question of how much I weigh but I won't tell you.

import random
from math import sqrt

def mean(data):
    return sum(data)/len(data)

def variance(data):
    mu=mean(data)
    return sum([(x-mu)**2 for x in data])/len(data)

def stddev(data):
    return sqrt(variance(data))



weight=[80.,85,200,85,69,65,68,66,85,72,85,82,65,105,75,80,
    70,74,72,70,80,60,80,75,80,78,63,88.65,90,89,91,1.00E+22,
    75,75,90,80,75,-1.00E+22,-1.00E+22,-1.00E+22,86.54,67,70,92,70,76,81,93,
    70,85,75,76,79,89,80,73.6,80,80,120,80,70,110,65,80,
    250,80,85,81,80,85,80,90,85,85,82,83,80,160,75,75,
    80,85,90,80,89,70,90,100,70,80,77,95,120,250,60]

print mean(weight)


def calculate_weight(data, z):
    # remove outliers
    # extract data between lower and upper quartile
    data.sort()
    lowerq - (len(data)-3)/4
    upperq = lowerq * 3 + 3
    newdata = [data[i] for i in range(lowerq, upperq)]

    # fit Gaussian using MLE
    mu = mean(newdata)
    sigma = stddev(newdata)

    # compute x that corresponds to standard score z
    x = mu + z * sigma
    return x

print calculate_weight(weight, -2.)

03 Weight Conclusion

There are all kinds of other cool stuff you can do with the data. This is a cleaned up data set where I took these extreme values out manually.And then I kind of want to scatterplot off today's weight versus last year's weight.And when I do this, I get an interesting graph.It is approximately linear, but there are some very funny outliers. There's a point over here. Obviously someone thought I went from 0 to 200 within a year.And there are also points over here that indicate massive weight loss or massive weight gain within the span of a year.And I have no clue why people believe I massively lost or gained weight.But as a statistician, looking at the data is really informative.It tells you that even in the spectrum of good guesses, there are some that are suspicious or people likely didn't take the question very seriously.As a statistician, your skill will be to look at these data points more deeply and possibly remove them based on what you see in this scatterplot over here.That goes way beyond the idea of quartiles but in the estimation of the mean those phenomena have a massive impact as to what your final outcome looks like.Thank you everybody for supplying the data that led to this programming unit.

Selection_004.png

04 Mean MLE

It's proof time again! 3 AM in the recording studio. And as always, this is entirely optional. This goes way beyond what will ever be taught in an introduction to statistics class.It's only for those who love hard-to-prove challenges.And as previously, we're going to prove the correctness of the maximum likelihood estimator or MLE but this time for Gaussians. Our Xi's can now be arbitrary values not just 0s or 1s. And the mean and the variance are computed using these two now familiar formula.How can we prove this to be correct? Again, last chance to skip.As before, there's the proof on the left side with a couple of blanks five in total.And there's seven expressions that you can put in and the way you put them in is if for example the seventh expression in your opinion fits over here you put the number 7 in here.You only use each expression on the right side once,and there's obviously more expressions on the right side than the holes on the left side.Let me quickly go over the proof.In maximizing the probability of the data, we first state the probability of these data items and then send over here using a normal distribution with prime of µ and σ².But that left something unknown on the left that you have to pick from the right side.Then we take the logarithm of this and here's the logarithm with two gaps in sight.We take the first derivative of the logarithm that you have to compute by picking the appropriate term on the right and set it to 0, which is where the maximum is obtained.From there, we can transform the expression to this expression over here where I left out a term and it finally gives you the desired proof for the maximum likelihood estimator for the mean.Good luck!

unnamed.jpg

05 Mean MLE

The first thing you have to do is figure out that this is actually a product. It's #5. Because you have multiple data points, the probabilities multiply.Going to the logarithm, we replace it with a sum that is #6.And the logarithm of the expression in the exponent is just its argument which means #3 is the solution here.You simply just copy those argument over.The first derivative loses all of these terms over here. They don't depend on µ.But it would change this one over here.The answer ends up being 4, Xi-µ/σ² and the reason is the way the derivative works.We take the derivative of this expression.There's a 2 over here that gets multipled left and a minus sign in front of the µ.Those get multiplied in and then canceled out with -1/2.Now from here to here, I just multiplied with σ²and now I bring the µ out and import this and the resulting expression is Nµ.The N comes in because we have N additions of µ in the formula over here.That completes the entire proof.


Note:
Here are all the steps of the proof together in one place:

The operator \prod_{i=1}^N means the product of the i terms from 1 to N. Therefore:

P(X_1, .., X_N) = \prod_{i=1}^N \frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]}

P(X_1, .., X_N) = \frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_1 - \mu)^2}{\sigma^2}]} \times \frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_2 - \mu)^2}{\sigma^2}]} \times \frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_3 - \mu)^2}{\sigma^2}]} ... \times \frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_N - \mu)^2}{\sigma^2}]}

Because of the following property from logarithms:

log(a \times b) = log(a) + log(b) (Property 1)

You can do the following:

log(P(X_1, .., X_N)) = log(\frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_1 - \mu)^2}{\sigma^2}]} ) + log(\frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_2 - \mu)^2}{\sigma^2}]}) + log(\frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_3 - \mu)^2}{\sigma^2}]}) ... + log(\frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_N - \mu)^2}{\sigma^2}]})

That's the same as:

log(P(X_1, .., X_N)) = \sum_{i=1}^N log(\frac{1}{\sqrt{2 \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]} )

Using property 1 once again for the term inside the summation operator:

log(P(X_1, .., X_N)) = \sum_{i=1}^N log(\frac{1}{\sqrt{2 \pi \cdot \sigma^2}}) + log(e^{[-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]})

From another logarithm property:

log(e^a) = a (Property 2)

Since the natural log function is the inverse of the exponential function. Therefore:

log(P(X_1, .., X_N)) = \sum_{i=1}^N log(\frac{1}{\sqrt{2 \pi \cdot \sigma^2}}) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]

You know that \frac{1}{\sqrt{2 \pi \cdot \sigma^2}} = (2 \pi \cdot \sigma^2)^{-\frac{1}{2}}, therefore:

log(P(X_1, .., X_N)) = \sum_{i=1}^N log((2 \pi \cdot \sigma^2)^{-\frac{1}{2}}) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]

Because of the following logarithm property:

log(a^b) = b \cdot log(a) (Property 3)

You can do the following:

log(P(X_1, .., X_N)) = \sum_{i=1}^N \frac{-1}{2} \cdot log(2 \pi \cdot \sigma^2) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]

Using property 1 again:

log(P(X_1, .., X_N)) = \sum_{i=1}^N \frac{-1}{2} \cdot (log(2 \pi) + log(\sigma^2)) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]

Using property 3 again:

log(P(X_1, .., X_N)) = \sum_{i=1}^N \frac{-1}{2} \cdot (log(2 \pi) + (2 \cdot log(\sigma))) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]

Simplifying:

log(P(X_1, .., X_N)) = \sum_{i=1}^N \frac{-1}{2} \cdot log(2 \pi) - log(\sigma) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]

Applying the partial derivative relative to \mu and equaling it to zero to find the maximum:

\frac{\partial(\sum_{i=1}^N \frac{-1}{2} \cdot log(2 \pi) - log(\sigma) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}])}{\partial \mu} = 0

The only term that depend on \mu is [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}], therefore the derivative of the other terms is equal to zero.

\frac{\partial [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]}{\partial \mu} = \frac{(X_i-\mu)}{\sigma^2}

Therefore:

\sum_{i=1}^N \frac{(X_i-\mu)}{\sigma^2} = 0

Multiplying both sides by \sigma^2:

\sum_{i=1}^N (X_i-\mu) = 0

\sum_{i=1}^N X_i - \sum_{i=1}^N \mu= 0

\sum_{i=1}^N X_i - N \cdot \mu= 0

\sum_{i=1}^N X_i = N \cdot \mu

\mu = \frac{1}{N} \cdot \sum_{i=1}^N X_i

Therefore \frac{1}{N} \cdot \sum_{i=1}^N X_i is the value of \mu that maximizes the probability (MLE or Maximum Likelihood Estimator).


06 Variance MLE

Now we talked about the easy thing, which is the mean. We're now going to move on to a much more challenging proof which is the σ².I once again started with the probability of data which is the very familiar product expression over here.And let me lay out the entire proof for you.As before, we start with the probability of data and as before we take the logarithm of it.And since I've already done this, I'll just restate what we previously derived.This would look familiar. If not, go back and watch the previous video.Now the key thing is that we take the derivative of spectral sigma.Let me lay out the entire proof for you. So here's how it looks.We take the logarithm this time with respect to sigma of all expression to arrive with this thing over here.We simplify this by multiplying something into here, factor out something over here,and we arrive at our desired result.Plug in those numbers as before.There's a total of 10 of those for seven open expressions on the left.

unnamed (1).jpg

07 Variance MLE

Let's start with the first expression. This one over here doesn't depend on sigma at all. Hence, the derivative 0. The derivative of logσ is 1/σ.And this expression is funky because you have a σ⁻² in the denominator.That results to 1/σ³ and then -2 multiplied in cancels the minus 1/2 over here very conveniently to with we have all these sigmas in the denominator.We multiply up σ³ which gives σ² over here and leaves only this expression on the right side.Taking out σ² to the left side, we observe that we have Nσ² in the sum.We get Nσ² and bringing N to the right gives us σ.This is the full proof that the empirical observed variance over the data is the maximum likelihood estimator for finding the parameters of a Gaussian.If you got this far, I am impressed.We left basic statistics way behind us and got into fairly heavy math.Thanks for taking this unit. If you just clicked forward and hit the next button, never mind.You can complete the do you understand basic statistics without understanding how it's being derived Let's move on and use the Gaussian for something practical.

Selection_005.png


Note:
From the MLE proof for \mu ("Mean MLE"), you know that:

log(P(X_1, .., X_N)) = \sum_{i=1}^N \frac{-1}{2} \cdot log(2 \pi) - log(\sigma) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]

Applying the partial derivative relative to \sigma and equaling it to zero to find the maximum:

\frac{\partial(\sum_{i=1}^N \frac{-1}{2} \cdot log(2 \pi) - log(\sigma) + [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}])}{\partial \sigma} = 0

The only terms that depend on \sigma are - log(\sigma) and [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}], therefore the derivative of the other terms is equal to zero.

\frac{\partial (-log(\sigma))}{\partial \sigma} = \frac{(-1)}{\sigma}

\frac{\partial [-\frac{1}{2} \cdot \frac{(X_i - \mu)^2}{\sigma^2}]}{\partial \sigma} = \frac{(X_i-\mu)^2}{\sigma^3}

Therefore:

\sum_{i=1}^N (\frac{-1}{\sigma} + \frac{(X_i-\mu)^2}{\sigma^3}) = 0

Multiplying both sides by \sigma^3:

\sum_{i=1}^N (-\sigma^2 + (X_i-\mu)^2) = 0

\sum_{i=1}^N (-\sigma^2) + \sum_{i=1}^N (X_i-\mu)^2 = 0

-N \cdot \sigma^2 + \sum_{i=1}^N (X_i-\mu)^2 = 0

N \cdot \sigma^2 = \sum_{i=1}^N (X_i-\mu)^2

\sigma^2 = \frac{1}{N} \cdot \sum_{i=1}^N (X_i-\mu)^2

Therefore \frac{1}{N} \cdot \sum_{i=1}^N (X_i-\mu)^2 is the value of \sigma^2 that maximizes the probability (MLE or Maximum Likelihood Estimator).

You may have noticed that the instructor's answer is incorrect. The last line should be \sigma^2 (variance), not \sigma (standard deviation)