**These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!**

**Please check the wiki guide for some tips on wiki editing.**

Contents

- 1 Problem Set 3: Estimators
- 1.1 01 MLE Proof Challenge (Optional)
- 1.2 02 MLE Proof Challenge (Optional) Solution
- 1.3 03 Standardize Data
- 1.4 04 Standardize Data Solution
- 1.5 05 Scale Data
- 1.6 06 Scale Data Solution
- 1.7 07 Scatter Plot Spread
- 1.8 08 Scatter Plot Spread Solution
- 1.9 09 Histogram Averages
- 1.10 10 Histogram Averages Solution
- 1.11 11 Variance
- 1.12 12 Variance Solution
- 1.13 13 Expected Variance
- 1.14 14 Expected Variance Solution
- 1.15 15 Correction Factor
- 1.16 16 Correction Factor Solution
- 1.17 17 Variance Correction
- 1.18 18 Variance Correction Solution
- 1.19 19 Medians
- 1.20 20 Medians Solution
- 1.21 21 Incremental Mean (Optional)
- 1.22 22 Incremental Mean (Optional)
- 1.23 23 Likelihood Challenge (Optional)
- 1.24 24 Likelihood Challenge (Optional)

And at first, I want you and me to derive this mathematically and it is challenging if you don’tlike calculus, you can skip over it. So feel free to skip. If you skip, no one is ever going to know and I promise you derivation is not essential for using this formula over here. You can use entirely without understanding why it’s correct.But if you stick with me, you get a sense of how statistics really works as a scientific discipline.So it’s a glimpse into the beauty of more advanced statistics. So last chance, you’re going to skip? Hey, thanks for staying on.So here’s my quiz and it looks complicated but let me explain it to you. I better derive for you the desired equation on here which is the maximum likelihood estimator all the way from the definition of the probability of the data. You might remember data is comprised of XIs where each XI is 1 one for heads and zero for tails.And of course, P is the desired probability of heads. I left open the derivation, however, nine steps or where these nine bars that will turn into boxes very soon, and what you have to do is you have to pick from the 13 choices over here,the ones that fit best. There is always a unique answer and you can only use each of the thing son the right side once. So there will be four things left, 13 minus 9, that you’ve never used.So for example, if you believe that the right answer over here is PN, then you’re going to type 6 into the box over here and PN is now used, you can't write 6 anywhere else. So good luck with this. The way the proof works is by first taking the logarithm of this expression over here,this ends up over here and we said the first derivative is zero. So you have to complete the first derivative and plug in here and then you have to transform it all the way until you achieve the final result. Good luck.

The answer to the first open question is number 9.1 - p to the 1 - Xi. And this is actually quite remarkable.The reason why I wrote p to the Xi over here is because we already learned every time we see a heads we're going to multiply in p.So by putting this in the exponent, when Xi = 1, we multiply in p.And if Xi = 0 we multiply in 1, which has no effect.This is exactly the inverse.You're going to multiply in 1 - p for the tails probability whenever we see tails,in which case 1 - Xi will be 1.Otherwise, 1 - Xi is 0, and we just multiply in 1, which has no effect.The logarithm of this guy over here is number 1, Xi log p.The first derivative of this expression over here is number 7,and it's interesting it's not number 2.There's a minus sign over here, and the minus sign is inherited from the minus sign inside the logarithm using the chain rule of differentiation.That was really non-trivial.This one will be number 11.We're multiplying in p and 1 - p. The 1 - p stays.And likewise, this here is number 4.Multiplying this out misses exactly 1 term, -pXi, which is number 3.Now we observe that these -pXi and +pXi cancel each other out,so we get this expression over here = 0, which is number 12.When we now take p to the other side,we realize there's N additions of p, number 6.And finally, we bring N back to the left side to use number 8, 1/N And we are now done.

*Note:*

Here's a step by step proof, all together in one place:

We want to find the value of

that maximizes the following expression:

Where

represents the result of each individual flip when flipping a coin times (from to ). It can be either equals to one (Heads) or zero (Tails). means the product of all terms. That's analogous to , but instead of adding the terms, we multiply them.If

, .If

, .For example, to calculate the probability of the outcome

, we should multiply the probability of each individual flip: . The expression is just a compact way to represent this for a generic number of flips .Maximizing an expression made of products isn't very convenient, so to make things easier, we do a trick. We apply the log function over the expression. The logarithm of a value grows as the value grows (greater the value, greater the log), therefore maximizing the value or log(value) is the same thing.

Also, the log function has the following nice property:

So, by applying the log function to our expression, we are transforming a product of terms into a sum of terms, which is much more convenient to work with.

Another interesting property of logarithms is the following:

Therefore:

Now we need a little calculus. In calculus, we can determined something called "Critical Points". These "critical points" can be either a maximum, a minimum or a inflection point:

** Source:** First Derivative Test on eMathHelp

If you look at these points, you will notice that they have one thing in common. The derivatives of these three points are equal to zero. The derivative can be seen as the slope of the line tangent to the graph that passes through the point. For the maximum and minimum the tangent lines are parallel to the horizontal axis of the graph, which means that the slope is equals to zero (therefore the derivative is zero). Using calculus techniques you can also determined if the critical point is a maximum, a minimum or a inflection point. Teaching calculus is not part of the scope of this class, so I hope you forgive me and trust me that if you take the derivative of the expression and equal it to zero, you're going to get the maximum. If you want to learn calculus, Khan Academy is a great place to start.

Now, our function is a function of

, since we want to know what probability maximize the expression.

The derivative of the sum is equal to the sum of the derivatives:

, and and are not functions of , therefore we can remove them from the derivative:

Let's calculate the derivatives of

and .From calculus,

, therefore:

From calculus

(That's called "The Chain Rule"). Let's make:

Substituting:

Multiply both sides by

doesn't depend on , therefore all terms depending on can be factored out of the summation:

Expanding:

Simplifying:

(That's adding up one times)

(That's the definition of )

So the value of

that maximizes the probability of flips is .We have a set of data with a mean of 5, a standard deviation of 4,and a variance of 16.Within the data we have a point Xi with a value of 9.What is the standard score of Xi?Please enter your answer here.

Recall that the standard score is the data minus the mean divided by the standard deviation.This equals 4 divided by 4, which is 1.

Now let's multiply every single piece of data by 1.5 and get some new data. We'll call them Ys. Can you tell me what is the mean--mu--of the Ys,what is their standard deviation--sigma--what is their variance--sigma squared--and what is the value of this point--now Yi--and what is the standard score--z--of Yi. Fill in your answers.

The mean is simply multiplied by 1.5, and that is 7.5.To see what happens with the standard deviation in the variance,recall that the standard deviation is defined as the square root of the sum of the difference between each observation and the mean.Since each of these are multiplied by 1.5 and it's then squared, that means this term below the square root gets bigger by 1.5 squared.But after we take the square root, it only gets bigger by 1.5 since the square root of 1.5 squared is equal to 1.5.So the standard deviation of the Ys is 6.Since the variance is the square of the standard deviation, it's 36.Since Yi is just the individual observation Xi multiplied by 1.5,it's 1.5 times 9, which is 13.5.And the z score or standard score remains at 1.To show ourselves this, let's just redo the calculation.13.5 is the data point, 7.5 is the mean. Their difference is 6.The standard deviation is 6. And 6/6 is 1.

Here we have a scatter plot of the height of adults versus their height as children. What I'd like you to tell me is whether the standard deviation of adult heights or the standard deviation of children's heights is larger.Please check the variable that has the greater standard deviation.

And the answer is adults.To see this, let's look at how the data is spread out along this scatter plot.We can see that from the mean somewhere right around here the spread between the high and the low points is something like this,whereas for children's heights the spread looks more like--Looking at ranges doesn't really tell you what the standard deviation is,but they are both measures of spread,and when there aren't really outliers such as in this case, they tend to tell you the same thing.And then just looking at the data,it looks pretty clear that it's spread out more in this direction than in this direction.

Consider this histogram displaying counts of numbers from 1 through 11. Note that each bar is centered at the value listed,so you can assume that that's the average of the values in the bar.So you can interpret this to mean there are 3 points with a value of 3,2 with a value of 4, 1 with a value of 6, 1 with a value of 5, and so on.What I'd like you to tell me is where the mean, median, and the mode are in this data.There's only 1 right answer for each.Check the circle for each of mean, median, and mode.

And for mode the answer is clear.The answer is 3 because 3 has 3 data points and no other value has more than 2. The median is also 3.We have 1, 2, 3, 6, 9, 10, 11 data points.Median value will be the 6th data point.It has 5 above and 5 below, and that happens to be a 3.The mean is a little trickier to calculate.We could probably eyeball it as being a little above 3 since if there weren't these points over here it would be 3,but let's just work it out quickly.1 + 4 + 9 + 8 + 5 + 6 + 11.And those equal 44.And we had figured before there were 11 data points,so it's 44/11, and that is equal to 4.

Consider the case where a number can be either 0 or 1 with 0.5 probability for each.What is its variance?Note that this is the same as figuring out the variance of 2 data points,1 that's 0 and 1 that's 1.

The answer is 0.25, one quarter.To see this, recall that the variance is simply the sum of the squared differences from the mean divided by n.So in this case we have--and so this is a quarter, this is also a quarter, so this is 0.5,and then we divide by 2, which gives us 0.25.

Now let's ask what we expect the variance to be when we generate a couple of data points,say by flipping coins. There are 4 possibilities: 0 and 0, 1 and 0, 0 and 1, and 1 and 1.In each case, compute the variance and then compute the average of the variances or the expected variance.

In the first case the answer is 0 because they're the same.In the second case we have data identical to this. The answer is 0.25.In this case it's the same, and in this case it's 0.Expected variance is the average of these numbers or 0.125.

By what factor do I need to multiply the expected variance I got using the normal variance formula to get the actual variance?Enter your answer here.

And the answer is, quite simply, 2.0.25 divided by 0.125.

Now I'd like to ask you if you can find a more general pattern for the correction factor for the variance calculation on a sample. This is a challenging question.The one example you did before may not have been enough for you to figure this out,so feel free to do more examples with different size samples to see if you can figure this out.Choose one of the following as the correction factor.Is it n, n + 1 over n, n over n + 1, n divided by n - 1, n - 1 divided by n,or is it n squared - 1 over n - 1?Recall that n is equal to the number of data points in the sample.Select the right answer.

And the correct answer is n over n - 1. We can see this just with looking at the single example because of the other options, the only other one that yields 2 is n.That would imply that the correction factor grows linearly with the amount of data.So if we have a million data points,our variance would need to be somehow multiplied by a million to come up with the actual variance.That doesn't correspond well with intuition because, presumably, as we have bigger and bigger samples,we should be getting closer to reality,although that's not a true proof, but you can prove it to yourself by trying this out on some larger sample sizes.

*Note:*

These problems about expectation and correction factor are suppose to give students a very interesting insight on statistics. The simple example with two flips doesn't suffice, but you need to start from it to derive the correction factor for flips, which will give you the insight.

The example lacks context, so let's start with a more practical example. Let's say you have an election poll between two candidates: Mr. Heads and Mr. Tails.

To find the true

(that would be the percentage of votes for Mr. Heads) and of this distribution you would need to wait until the elections are over so you can count all votes, but statisticians try to estimate and even while the candidates are still campaigning.How they do that? They do sampling, i.e., they choose a sample of size

, which is much smaller than the whole population of voters . From this partial data, they can get an estimation of the true and of the distribution.That's exactly what Adam is doing, but for a very simple case (the size of the sample is two):

```
Flip #1 Flip #2 Mean Variance
0 0 (0 + 0)/2 = 0 ((0 - 0)^2 + (0 - 0)^2)/2 = 0
0 1 (0 + 1)/2 = 0.5 ((0 - 0.5)^2 + (1 - 0.5)^2)/2 = 0.25
1 0 (1 + 0)/2 = 0.5 ((1 - 0.5)^2 + (1 - 0.5)^2)/2 = 0.25
1 1 (1 + 1)/2 = 1 ((1 - 1)^2 + (1 - 1)^2)/2 = 0
```

Each of these outcomes represents a sample. This table represents all the possible outcomes for a sample of size two. In an election poll you would only sample once to get your estimate of the variance, but the objective of this simple example is to show the relationship between an estimation of

(obtained through sampling) and the true of the distribution.If you average the variances of these outcomes, you will get the following:

What is the meaning of this value? If you sample from the distribution and calculate the variance, you **expect** that your estimation will be close to the average of all possible variances, thus **expected** variance.

For a simple example like this one, where the probability of the coin

is given, it's easy to calculate the true and :

Therefore the ratio between the true

and the estimation of isWhat does this mean? Means that if you estimate the variance of the distribution through sampling, you would be wrong in average by a factor of

.To calculate the correction factor for a sample of size

is easy enough, but to get some insight you need to do better than that. You need to find what's the correction factor for a sample of size .You might find this proof a bit overwhelming, so you don't need to go through it if you don't want to. Adam didn't even ask for a proof. He made a multiple choice quiz so you could find out that:

By simply testing some values of

. You might do that by finding the correction factor for in the same fashion you did for (with a truth table). There is only one option that works for both values of .The estimation of

for a sample of size is:

Let's say my sample has

Heads (and therefore Tails). You can break the summation into two summations, one for Heads and one for Tails as follows:

The estimation of

for a sample of size with Heads is:

Once again, you can break this summation into two summations, one for Heads and one for Tails as follows:

Note that no term depends on

, therefore:

Simplifying:

Now you know what's the variance for a sample of size

with Heads. To calculate the average of the variances, you need to calculate the variances for each possible value of , from to .Remember that you can calculate the average by multiplying each outcome by its probability and then adding them up:

(as it has been done for a binomial distribution in previous units)

Therefore you can calculate the average of the variances as follows:

The number of outcomes where the number of Heads is

is given by:

The frequency (or probability) of outcomes where the number of Heads is

is:

Therefore

gives you the average of the estimated variances.

But for

(all Tails) and (all Heads) the variance is zero, therefore:

Let's expand some of the terms:

Now the equation can be simplified:

Rearranging the terms (putting the terms that do not depend on

outside the summation):

Now let's make a simple change of variables in the summation term:

For the summation limits, you can do the following: If

, . If ,So the limits after the change of variables become: from

to '.

The sum of the binomial probabilities from

to for samples is (That's the sum of the probabilities for all possible outcomes and it's always no matter the value of ). Therefore:

You know that the true

for a binomial distribution is given by:. Therefore the correction factor is their ratio:

Testing: If

:(Exactly what we got when we did the calculation for the simple case).

**Finally, the insight:**

Note that as we increase the value of

, will become closer and closer to . This means that if we estimate the variance of a distribution by sampling, greater the size of our sample, closer our estimation will be to the true variance of the distribution. So, that's why sampling works and why statisticians can make a good estimation of the result of an election poll by sampling (as long as is big enough).Now I'd like to ask you to find a median for me,but I'd also like to introduce a little twist.In the lecture, we said that the median is an element in the set, in the middle of the set.So if there's an even number, it can be 1 of 2 things.So I'd like you to give me one of those medians here.We'll call this the in data type of median.But the problem with it is it's not unique.Many people prefer to use a unique median,and to do that, they just take the simple mean of the acceptable answers for the in data method.Tell me the unique median over here.

And the answer to this question is fairly straightforward.In data we have 2 possibilities.The median could either be 9 or it could be 15.So we'll accept either of those.In order for it to be unique, we just need to take their average,and that's equal to 12.

Now I'd like to ask you a couple of programming questions. These are completely optional, but if you did the programming in the unit,you might want to try these out.During the unit, you wrote a function that can compute the mean of a set of data.I'd like to now ask you to write something a little different.Write a function, mean, that can take a known mean of existing data--in this case I called it current mean--a count of the amount of data I have, and a new value and tell me what the mean is now.And so to call my mean function I'll just type in mean(current mean,current count,new)And we can see it gives me the correct answer of 9.0.Here is the function, mean, that you'll fill in.It takes the old mean which we called current mean below,the number of observations--n--and the new value to be added--x.Insert your code right below the comment. Have fun.

```
#In class you wrote a function mean that computed the mean of a set of numbers
#Consider a case where you have already computed the mean of a set of data and
#get a single additional number. Given the number of observations in the
#existing data, the old mean and the new value, complete the function to return
#the correct mean
from __future__ import division
def mean(oldmean,n,x):
#Insert your code here
print mean(currentmean,currentcount,new) #Should print 9
```

And here is my answer.There are a number of ways to do this, but a simple one is to take the old mean,multiply it by the number of observations, which gets us the sum,add to it the new value, which gets us the new sum,and then divide it by the new number of observations,which is just one more than the old one.

```
#In class you wrote a function mean that computed the mean of a set of numbers
#Consider a case where you have already computed the mean of a set of data and
#get a single additional number. Given the number of observations in the
#existing data, the old mean and the new value, complete the function to return
#the correct mean
from __future__ import division
def mean(oldmean,n,x):
#Insert your code here
return (oldmean*n+x)/(n+1)
print mean(currentmean,currentcount,new) #Should print 9
```

Now I'd like to ask you to do a slightly more challenging programming assignment. Imagine we have a die that has an arbitrary number of sides and each side can be labeled anything we want.And the die can be fair or it can be loaded.I called that here dist.You can think of it as a distribution of values or, more formally, a probability distribution,but that doesn't really matter for this exercise because in the lecture we learned about likelihood in the context of coins.Here I'd like to ask you about in the context of a die that can have any number of sides,labeled anything, with any probability.And I'd like you to write a function called likelihood that takes the data and the distribution and returns the likelihood.I wrote one. Let's see what it produces.You can see we get a very, very small number.That's actually quite common with likelihoods because there are so many different data values,any given one likely has a pretty small value.That said, they can still be interesting.Now let's look at the code you're going to have to write.Here is the function.Likelihood that takes a distribution and data, just as we saw below.Insert your code here. Have fun.

```
#Compute the likelihood of observing a sequence of die rolls
#Likelihood is the probability of getting the specific set of rolls
#in the given order
#Given a multi-sided die whose labels and probabilities are
#given by a Python dictionary called dist and a sequence (list, tuple, string)
#of rolls called data, complete the function likelihood
#Note that an element of a dictionary can be retrieved by dist[key] where
#key is one of the dictionary's keys (e.g. 'A', 'Good').
def likelihood(dist,data):
#Insert your answer here
tests= [(({'A':0.2,'B':0.2,'C':0.2,'D':0.2,'E':0.2},'ABCEDDECAB'), 1.024e-07),(({'Good':0.6,'Bad':0.2,'Indifferent':0.2},['Good','Bad','Indifferent','Good','Good','Bad']), 0.001728),(({'Z':0.6,'X':0.333,'Y':0.067},'ZXYYZXYXYZY'), 1.07686302456e-08),(({'Z':0.6,'X':0.233,'Y':0.067,'W':0.1},'WXYZYZZZZW'), 8.133206112e-07)]
for t,l in tests:
if abs(likelihood(*t)/l-1)<0.01: print 'Correct'
else: print 'Incorrect'
```

And here is my answer. To get the likelihood of a set of numbers, we just need to multiply in the probability of seeing every single data point.So we start at 1 since that's the identity.For every data element we go through and multiply by the probability for element i and then we return it.If you're interested in a real challenge, note this function can actually be written on just 1 line.So for anyone watching who is a real Python enthusiast,feel free to post your 1-liners to the forums. Have fun.

```
#Compute the likelihood of observing a sequence of die rolls
#Likelihood is the probability of getting the specific set of rolls
#in the given order
#Given a multi-sided die whose labels and probabilities are
#given by a Python dictionary called dist and a sequence (list, tuple, string)
#of rolls called data, complete the function likelihood
#Note that an element of a dictionary can be retrieved by dist[key] where
#key is one of the dictionary's keys (e.g. 'A', 'Good').
def likelihood(dist,data):
#Insert your answer here
l=1
for i in data:
l *= dist[i]
return l
tests= [(({'A':0.2,'B':0.2,'C':0.2,'D':0.2,'E':0.2},'ABCEDDECAB'), 1.024e-07),(({'Good':0.6,'Bad':0.2,'Indifferent':0.2},['Good','Bad','Indifferent','Good','Good','Bad']), 0.001728),(({'Z':0.6,'X':0.333,'Y':0.067},'ZXYYZXYXYZY'), 1.07686302456e-08),(({'Z':0.6,'X':0.233,'Y':0.067,'W':0.1},'WXYZYZZZZW'), 8.133206112e-07)]
for t,l in tests:
if abs(likelihood(*t)/l-1)<0.01: print 'Correct'
else: print 'Incorrect'
```