**These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!**

**Please check the wiki guide for some tips on wiki editing.**

Contents

- 1 Problem Set 4: Outliers and Normal Distribution
- 1.1 01 Quartiles
- 1.2 02 Quartiles Solution
- 1.3 03 Survey
- 1.4 04 Survey Solution
- 1.5 05 Random Walk 1
- 1.6 06 Random Walk 1 Solution
- 1.7 07 Random Walk 2
- 1.8 08 Random Walk 2 Solution
- 1.9 09 Random Walk 3
- 1.10 10 Random Walk 3 Solution
- 1.11 11 Random Walk 4
- 1.12 12 Random Walk 4 Solution
- 1.13 13 Random Walk 5
- 1.14 14 Random Walk 5 Solution
- 1.15 15 Random Walk 6
- 1.16 16 Random Walk 6 Solution

In the unit we learned about quartiles.Here we have a set of 8 numbers.I'd like you to tell me between which numbers the quartile boundaries occur.That is, on this side of the boundary the number is in one quartile--say the first quartile--and on this side it's in the second.Please check each box at which there is a quartile boundary.

We have 8 elements--one, two, three, four, five, six, seven, eight.So, each quartile, since we have four of them has exactly two elements.We go two over here, here, and here are the quartile boundaries.

Consider a website with 100,000 users.A thousand of those are highly active, and you want to take a survey.Now, to simplify this, let's assume that you draw the users at randomand that everyone responds.If you send out 10 surveys, I'd like you to tell me the probabilitythat the number of highly active users in our sample is greater than or equal to 2.We have at least two of our highly actives in our sample of 10 surveys.Now, remember that sometimes it may be easier to compute the probabilityof something being less than a given number,and you can take 1 minus that and that is always going to be equal,since the probabilities always have to add up to 1.Enter your answer here.

*Note:*

For simplicity sake, assume that the proportion of non-active users and active users doesn't change when you sample from the list, i.e., even though the list gets smaller after you sample, e.g., first (100,000 + 1,000 - 1), then (100,000 + 1,000 - 2), then (100,000 + 1,000 - 3), ... (100,000 + 1,000 - 10), the proportion is always 100,000 non-active users for each 101,000 users and 1,000 active users for each 101,000 users.

To answer this we're going to use this trick,and that means we need to get p(x).Let's call the highly actives in our sample x to simplify our notation.So, we need that less than 2 and that is going to beequal to p(x = 0) + p(x = 1).Each of these is a binomial variable.Recall what we need to figure out is the probability of the result being true--the probability of getting a highly-active user--p(x) = 1000/100,000, which is 0.01.The probability of x = 0--since this can only be arranged one waywe don't even need to know about binomials--is 0.99¹⁰.The probability of x equaling 1 is going to be 10!/9!1!. 0.99⁹ 0.01¹.So, if we total this all together, we get [0.099573],and then we'll recall from this formula over here we wanted the opposite thing,so we just take 1 minus, this equals our answer.

Now, I'd like to ask you about random walks.These are really interesting things that show up in everything from finance to physics,but amazingly they're the exact same thing you learned about when you learned aboutwhen you learned how to manipulate normal distributions.Let's assume we have some object--it could be me,it could be a point, it could be an insect,it could be a particle in a fluid--sitting right here at position 0.Now, let's assume that it moves every second in a normally distributed way.Its movement has a mean of 0 and a standard deviation of 1.Note that in this case the variance and the standard deviation are actually the samesince they're both 1.What I'd like to ask you is after 1 second tell me what the meanand standard deviation of the object's position is.Please tell me the object's mean position and the variance of that position.

Since on average it didn't move,we just add the mean of its starting position to the no movement,and we get a mean of 0.But it has a standard deviation or variance of 1.

Now could you tell me what is the mean and variance of its position after 2 seconds?

Its mean is still at 0, since it's on average not moving.To get its variance, we have to add up two normally distributed random variables.Recall that variances add, so the answer is simply 2.

Now let me ask you a somewhat more challenging question.What is the distribution of the object's position after 10 seconds?

The mean is still at 0,but for the variance we now need to add up 10 individual variances.There is one each time.Since all of these are equal to 1, adding them is the same as saying 10 1,which just equals 10.

Now that you see how this works when µ = 0.I'd like to change it.When µ is not equal to 0 we call this a random walk with drift,since it tends to drift in one direction or the other.In this case, we're going to have it drift to the right,specifically, we'll say that it now has a mean of 1 and a variance of 2.Apologize in advance. I'm sure my drawing isn't quite to scale.Now, to really test your understanding, I'd like you to tell meat 1 second and 10 seconds what are the mean and varianceof the object's position?

And at t = 1 it is fairly straightforward.It moves on average by 1 to the right so the mean is going to be 1.The variance is going to be 2.At t = 2, it's now moved 1 more to the right no average, so the mean is 2,and the variance is going to be 4.At 10 they also just both multiply.The mean will be 10, and the variance will be 20.

Now I'd like to ask you one more question about these random walks.What is the standard deviation of its position at each point in time.Please fill in your answers here.

This is fairly straight forward.We just take the square roots of each of these numbers.The square root of 2 is approximately 1.414.The square root of 4 is simply 2.The square root of 20 is approximately 4.472.What's interesting about this is that these tend to grow more slowlythan does the mean or the variance.They don't grow linearly with time.

Now let's divide this line into several regions.What I'd like you to tell me is in which of these regions could our object be at t = 2.Here are boxes for each of these regions. Check all that apply.

The answer is that it can be in any of these regions.All of the boxes should be checked.To see this, remember that our probability at each point is this formula,and the key thing to remember about this formula from the lecture isit becomes very close to 0 in both directions,and the probability of it being here or here is very small.It's not 0, so it could be here.