These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
I will now teach you about estimators.You will learn exciting terms like maximum likelihood estimator and laplacian estimator.And what this will empower you to do is to derive the probability from things such as coin flips.Let's get started. In this class, there's the subtitle to fake or not to fake.And without giving it away, the outcome will surprise you. So let's get started.
Suppose I flip a coin 6 times and get this as an output 100101,where 1 represents heads and 0 represents tails.The problem I'm addressing is I'd like to find out what should we think about this coin.In particular, if someone asks us for this coin what is its probabilityto come up heads, what would you say?
And obviously, this looks like a fair coin. It comes up heads 3 times and tails 3 times.So how about 0.5?
Different coin. Now flip it 5 times. What do you think the probability of heads should be?
And the most obvious answer is what's called the empirical frequencywhere the word empirical means the same as observations, what you actually saw.So we have 4 of the 5 coin flips gives us heads. So the obvious answer is 0.8.
Let me do a third quiz--this coin was flipped seven times and it came up tails every single time.What do think we should see now using the same method.
And you'll probably say 0.
Now look I had a fixed formula in all of those.If you call the data X₁, X₂ all the way to Xn where n is the number of coin flips,I wonder which of the following formula do you think best capture as what we we're doingthe sum/Xi, 1/Nsum/Xi, the product of Xi or 1/Nthe product of Xi.
And I'm sure you guessed this correctly--we're taking the sum of the outcomes of the normalizedby the total number of experiments and because the outcomes is 0 and 1, the sum over herecannot exceed the value n; hence, this expression over here will always be between 0 and 1,which is the value of probability.For reasons that should be obvious in a minute, we call this the maximum-likelihood estimator.Keep this formula in mind--it is a really good way to guess an underlying probabilitythat might've produced any given data set.Let's generalize this to more than two outcomes.
Here is a die with six possible outcomes.Let's assume you've got the following data sequence for 10 different experiments, and nowwe're going to apply the same method to compute the probabilities of the six different outcomes.Please enter them the six cubes over there.
Using the maximum-likelihood estimator MLE that we just discussed and here are my answers.1 was observed in 1 out of 10 experiments, gives us 0.1--1 divided 10.2 was observed twice 0.2, 3, 4 and 5 was observed once each,and 6 was observed four times, one, two, three and four, is a 0.4.
Now, these are the most likely settings for the underlying probabilities.Just for safety check, let's add them up and tell me what the sum of those probabilities would be.
It's 1. All these probabilities always have to add up to 1, so you can easily check this.
Let's now dive in and understand how to use this cryptic name maximum likelihood estimator.The reason is because the estimation problem is given data, find the probability pwhich now a simple example was the probability of heads.Given a fixed p we started in probability, what's the probability of the data.So let's begin with the second and find a good solution for the first problem--the estimation problem, and again I'm going to ask you a quiz.Supposed my data looks like this--I want you now for the different valuesof p to calculate for me what's the probability of the data.Remember this is exactly the problem we discussed earlier whenwe introduced probabilities of multiple independent events.
The answer for the first one is 0.125--that is the half times the half times the half, makes an 8th.
Let's now change p to be 1/3.With the exact same data, what is now the probability of that data?
And the answer is 0.074. The first one would give us a probability of 1/3 · 2/3 · 1/3.And the model that heads come is a probability of 1/3.These are the individual probabilities. You multiply them together.And when you do this, you get 0.074, which is the same as 2/27.
Let's keep searching. P = 2/3. What do we get?
We get 0.148. This is the largest number so far.It's the product of 2/3 for the first one times 1/3 times 2/3 or 4/27.So obviously the larger the value of p the better the result.
So let's go to extremes. P = 1. What's the probability of data now?Give me your best answer.
And bummer--it's is 0. These two guys contributed 1.But the middle guy has literally a probability of 0.So 1 x 0 x 1 is 0.
If we graph this with the probability as the horizontal axis and P(data) as the vertical axis.Then, for 1/2 we get a certain value. For 2/3, we get a larger value.For 1/3, you might be down here. And for the extremes of 1 and 0, you end up down here in 0.So we effectively are getting a cliff but as we find out P is exactly at 2/3and goes to 0 in both extreme so this point, 2/3, is the one that maximizes the likelihood of the data.Therefore, it's called the maximum likelihood data estimator or MLE.
Now let's expose the weakness of the maximum likelihood estimator.Let's be wicked. Suppose I flip the coin exactly once and it comes up heads.What do you think is the probability of heads for this coin under the maximum likelihood estimator?
And the answer is 1. 1 heads over 1 observation. That's 1.
Let's say you flip another coin once and you get 0. It's a complete set of experiment.What do we now think the probability of heads is?
And the answer is 0. 0/1.
Now does this mean from a single coin flip, the maximum likelihood estimatorwill always assume a loaded or biased, yes or no?In particular, what if I use an unbiased coin like this one?Will the maximum likelihood estimator always be doomed to be wrong?
And very certainly, the answer is yes.From a single coin flip, the maximum likelihood estimator will always assumethat all future coin flips come from the same way.
Let me ask you again. Suppose you make 111 coin flips.Will the maximum likelihood estimator always assume that the coin is loaded?
And very certainly, the answer is again yes. An unloaded coin has a probability of 0.5.So whatever the ratio the top, the observations of headsmust be exactly half the total number of observations.If the total number of observations is 111, that's just not possible.So the coin will look ever so small loaded even if it's a fair coin.
That is frustrating. Somehow the maximum likelihood estimator seems reckless.We have almost no data, and it comes up with these really extreme guessesof what the underlying probability is.I would submit this as unintuitive.If you and I flip a coin and it comes up heads,we would not say obviously this coin always comes up heads.We would say maybe a little bit more frequently but let's flip it more often.You're not yet convinced there might not be tails as there is in this case.
There is a solution. The solution is to fake it or more precisely to add fake data.So what I'll ask you to do is to take the original data for the coin,and add two fake data points—one heads and one tails.So if my data was a single observation of heads, give me first the maximum likelihood estimate.And in the second box, give me an estimator that has one fake data point.And on this fake data, apply its maximum likelihood.So this is the one without fake data. This is the one with the fake data.
As before, without it'd be 1, but with it'd be 2/3 or approximately 0.667.The reason is without fake data, one out of one coins flips gave me heads,but with the fake data we had three experiments, of which two came up heads,so 2/3 is 0.667.
The same for the data 001, please.Without fake data, with fake data.
It'll be 0.33, because 1 out of 3 experiments gave me heads,but when I add the fake data, we get 2/5 or 0.4.
Let's take a case where both events are equally likely.Given me the number for the original data set and for the original fake data set.
It happens to be 0.5 in both cases.In 2 out of 4 experiments, we saw 1.With fake data, it's 3/6, which is still a half.
Just for the fun of it, let's assume we don't just have 1 heads, but we also have heads twice.What's going to happen for the original data?What do you get for the fake data?
For the original data, it's once again 1--2 out of 2 experiments ended up heads.But for the fake data, we know that 3 out of 4 experiments come up headsand that's 0.75.
There are a couple of things to observe here.One is in general the fake data pulls everything towards 0.5.Where you go to extremes over here, we are less extreme in this case.0.33 is further away from 0.5 than 0.4.So, all these numbers get moved towards 0.5.This is somewhat smoother.We also see that these two outcomes--the first and the last--on the division model gives us the same extreme estimate,but the more data we get in our new estimator, the more we are willing to move away from 0.5.One observation of heads gave us 0.667, two of them 0. 75.I can promise you in the limit, as you only see heads for infinitely many,we will finally approach 1. Now, this is really cool.We added fake data, and I will tell you that I generally think these are better estimates in practice.The reason why is it's really reckless after a single coin flip to assume that all coins come up positive.I think it's much more moderate to say, well, we already have some evidencethat heads might be more likely, but we're not quite convinced yet.The not quite convinced is the same as having a prior.There's an entire literature that talks about these priors.They have a very cryptic name.They're called Dirichlet priors.But, more importantly, the method of adding fake data is called a Laplacian estimator.When there is plenty data, Laplacian estimator gives about the same resultsas the maximum likelihood estimator.But when data is scarce, this works usually much, much, much betterthan the maximum likelihood estimator.It's a really important lesson in statistics.
We study again our die.You observe the following sequence: 1, 2, 3, 2.I'll ask you know for the two different estimators what is the maximum like estimatorand what is the laplacian estimator for our estimate of the probability of the outcome 1for this die, based on that data.
Let's start with the MLE.Obviously, it's 1 out of 4. It's 0.25.Now, let's move to the Laplace estimator.The way I defined the fake data is that I added a data point for each possible of the 6 outcomes for the die.The answer should be 0.2 with 6 fake data points--1, 2, 3, 4, 5, 6.We get to observe the outcome 1 twice and that gives us 2 over 10 or 0.2.
Please do the same for the probability of 2, and please give me both numbers.
2 shows up twice in the original data out of 4 times, which makes 0.5.In the new data, we're observing it 3 times of 10, which is 0.3.
So, in summary, we talked about the maximum likelihood estimator,and we even derived it mathematically for those of you who had the patience to stay on.It's a really simple formula. It's what's called the empirical count.We talked about the Laplace estimator, which added k fake data points, 1 for each possible outcome,and that resulted in the slightly more complex out, as shown in this formula over here.and we identified cases where the Laplace estimator is much better than the maximum likelihood estimator,specifically when there isn't much data.When there isn't much data, fake it by adding more data.I'll see you in class in the next unit.
So we learned something really cooland we ask ourselves, shall we fake or not fake our data?I'm not so sure.When the data is scarse, I believe faking it gives you a better result.It is unreasonable to assume that after a single coin flipyou know the exact probability and it's either 0 or 1.It's much more reasonable to say we don't know yetand take a more caution 2/3 or 1/3 as a probability.So obvioulsy I've shown to you that even under very very very simple methodsfaking it might be justified and in fact give you better results.That blows my mind.