st101 » week-4 »

Problem Set 4: Outliers and Normal Distribution


These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.

Contents

01 Quartiles

In the unit we learned about quartiles. Here we have a set of 8 numbers. I'd like you to tell me between which numbers the quartile boundaries occur.That is, on this side of the boundary the number is in one quartile--say the first quartile--and on this side it's in the second.Please check each box at which there is a quartile boundary.

unnamed.jpg

02 Quartiles Solution

We have 8 elements--one, two, three, four, five, six, seven, eight. So, each quartile, since we have four of them has exactly two elements. We go two over here, here, and here are the quartile boundaries.

03 Survey

Consider a website with 100,000 users.A thousand of those are highly active, and you want to take a survey. Now, to simplify this, let's assume that you draw the users at random and that everyone responds.If you send out 10 surveys, I'd like you to tell me the probability that the number of highly active users in our sample is greater than or equal to 2.We have at least two of our highly actives in our sample of 10 surveys.Now, remember that sometimes it may be easier to compute the probability of something being less than a given number,and you can take 1 minus that and that is always going to be equal,since the probabilities always have to add up to 1.Enter your answer here.

unnamed (1).jpg


Note:
For simplicity sake, assume that the proportion of non-active users and active users doesn't change when you sample from the list, i.e., even though the list gets smaller after you sample, e.g., first (100,000 + 1,000 - 1), then (100,000 + 1,000 - 2), then (100,000 + 1,000 - 3), ... (100,000 + 1,000 - 10), the proportion is always 100,000 non-active users for each 101,000 users and 1,000 active users for each 101,000 users.


04 Survey Solution

To answer this we're going to use this trick,and that means we need to get p(x). Let's call the highly actives in our sample x to simplify our notation.So, we need that less than 2 and that is going to be equal to p(x = 0) + p(x = 1).Each of these is a binomial variable.Recall what we need to figure out is the probability of the result being true--the probability of getting a highly-active user--p(x) = 1000/100,000, which is 0.01.The probability of x = 0--since this can only be arranged one way we don't even need to know about binomials--is 0.99¹⁰.The probability of x equaling 1 is going to be 10!/9!1!.  0.99⁹ 0.01¹.So, if we total this all together, we get [0.099573],and then we'll recall from this formula over here we wanted the opposite thing,so we just take 1 minus, this equals our answer.

05 Random Walk 1

Now, I'd like to ask you about random walks. These are really interesting things that show up in everything from finance to physics,but amazingly they're the exact same thing you learned about when you learned about when you learned how to manipulate normal distributions.Let's assume we have some object--it could be me,it could be a point, it could be an insect,it could be a particle in a fluid--sitting right here at position 0.Now, let's assume that it moves every second in a normally distributed way.Its movement has a mean of 0 and a standard deviation of 1.Note that in this case the variance and the standard deviation are actually the same since they're both 1.What I'd like to ask you is after 1 second tell me what the mean and standard deviation of the object's position is.Please tell me the object's mean position and the variance of that position.

unnamed (2).jpg

06 Random Walk 1 Solution

Since on average it didn't move,we just add the mean of its starting position to the no movement,and we get a mean of 0.But it has a standard deviation or variance of 1.

07 Random Walk 2

Now could you tell me what is the mean and variance of its position after 2 seconds?

unnamed (3).jpg

08 Random Walk 2 Solution

Its mean is still at 0, since it's on average not moving.To get its variance, we have to add up two normally distributed random variables.Recall that variances add, so the answer is simply 2.

09 Random Walk 3

Now let me ask you a somewhat more challenging question. What is the distribution of the object's position after 10 seconds?

unnamed (4).jpg

10 Random Walk 3 Solution

The mean is still at 0,but for the variance we now need to add up 10 individual variances. There is one each time.Since all of these are equal to 1, adding them is the same as saying 10 1,which just equals 10.

11 Random Walk 4

Now that you see how this works when µ = 0.I'd like to change it.When µ is not equal to 0 we call this a random walk with drift, since it tends to drift in one direction or the other.In this case, we're going to have it drift to the right,specifically, we'll say that it now has a mean of 1 and a variance of 2.Apologize in advance. I'm sure my drawing isn't quite to scale.Now, to really test your understanding, I'd like you to tell meat 1 second and 10 seconds what are the mean and variance of the object's position?

unnamed (5).jpg

12 Random Walk 4 Solution

And at t = 1 it is fairly straightforward.It moves on average by 1 to the right so the mean is going to be 1.The variance is going to be 2.At t = 2, it's now moved 1 more to the right no average, so the mean is 2,and the variance is going to be 4.At 10 they also just both multiply.The mean will be 10, and the variance will be 20.

13 Random Walk 5

Now I'd like to ask you one more question about these random walks. What is the standard deviation of its position at each point in time.Please fill in your answers here.

unnamed (6).jpg

14 Random Walk 5 Solution

This is fairly straight forward.We just take the square roots of each of these numbers. The square root of 2 is approximately 1.414.The square root of 4 is simply 2.The square root of 20 is approximately 4.472.What's interesting about this is that these tend to grow more slowly than does the mean or the variance.They don't grow linearly with time.

15 Random Walk 6

Now let's divide this line into several regions. What I'd like you to tell me is in which of these regions could our object be at t = 2.Here are boxes for each of these regions. Check all that apply.

unnamed (7).jpg

16 Random Walk 6 Solution

The answer is that it can be in any of these regions. All of the boxes should be checked.To see this, remember that our probability at each point is this formula,and the key thing to remember about this formula from the lecture is it becomes very close to 0 in both directions,and the probability of it being here or here is very small.It's not 0, so it could be here.