st101 » week-3 »

16.  Programming Estimators (Optional)


These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.

Contents

01 Mean

So this is an optional unit.I want you to program what you just learned in the previous units,and to me, this is an enormous amount of fun.It's challenging, especially if you don't have programming experience,but it's also the moment where you can really exercise and deeply understand some of the very basic concepts we talked about before.Of course these concepts weren't particularly hard.Really important is this is optional.This is not required for completing the class.This is really just a fun exercise,so feel free to go to the next unit if you don't want to program.So in the first exercise, we will calculate the mean of data.So we'll define what "mean" means.This means define mean, and mean is computed of something.So it's computed of data,and there's this funny notion of return,where you put the mathematical expression for the mean.So when you want to print, say, the mean of this specific data set here--let's call this "data 1"--you would say print mean of data 1.Now this is a little bit more complicated than the kind of instructions we did before.I'm actually defining what's called a function,and the reason why I do this is it allows us to test your function with different data examples to make sure it's really correct.But the key thing is you have to return the correct thing over here.I'll give you a hint:In Python, there are special commands.One is called "sum."Sum applies to lists like this one over here.It gives you the sum of all the elements.In this case, sum of data should give you the sum of those numbers over here,which will be 2 if you add them all up.The other convenient function that is part of the programming system of Python that you should just know it exists is called "len," short for length.How long is this thing?This thing here is 5 elements. One, two, three, four, five.So that'll give you 5.So let's dive in. Here's our programming environment with a data sequence.I'm setting up the mean function right over here.You are to return something.This is where you put your code.And then for testing, I just say let's run this function and print up what it returns. That's the syntax.So let me give you an example:Suppose you put a fixed value over in here,like 12 in this case.Now we hit the run button.Then the output would be 12, which is not the correct answer,but you can play with that.If you say return sum of data,which is the command I've just given you, and hit the run button,then for this specific data sequence you get 8 and not 16.0 as the answer.Now the job is yours to plug in the right answer over in here.

#Complete the mean function to make it return the mean of a list of numbers

data1=[49., 66, 24, 98, 37, 64, 98, 27, 56, 93, 68, 78, 22, 25, 11]

def mean(data):
    #Insert your code here

print mean(data1)

02 Mean Solution

And of course, the correct answer for the mean is to sum up all the data and divide it by the number of data items.So you use the command sum(data)divided by len(data)--realize that inside the function this thing is called data.Then if we plug in data 1, the specific data sequence, into the function,and we go and hit the run button, we get 54.4 as an answer.That's the correct mean.So if you got this right,you wrote an interesting piece of software already.Now if you have lots of programming experience, this was trivial,but if you're new to programming, it's actually quite remarkable.

#Complete the mean function to make it return the mean of a list of numbers

data1=[49., 66, 24, 98, 37, 64, 98, 27, 56, 93, 68, 78, 22, 25, 11]

def mean(data):
    #Insert your code here
    return sum(data)/len(data)

print mean(data1)

03 Median

Slightly more challenging to program is the function median. Now the median of this list is the middle element of the sorted list,which in this case will be 2.Of course, if there is an even number of elements--let me just add one--the median isn't exactly defined.So let's say I pick either one of the two center ones.All the examples I'm testing with will have an odd number of elements.Okay? So let's not worry about the case where it is of even length.Let's just make sure our code runs for an odd length.So this one is more complicated to program, and there are two hints.First, there is a function called "sorted."You can give a data, and the output of the function gives you a sorted list.That's built into Python, so we don't have to worry about how to sort things.Luckily not.It will be used to assign this to a new list,and the way you do this, you just give it a name.Say sorted data is data.That gives you a sorted list.The second thing to know is if you want to access any element in a structure like this or in a structure like this, you can do this using notation like this.Now this is the tricky thing. This doesn't give you the third element in the list.It gives you the fourth, and the reason is each list is indexed starting with zero.So this list has five elements, and the indices go from zero to four.So to go to the center element,you would have to use index number two,and that gives you effectively the third element in the list.I apologize for this. This is all over computer science,that the indices tend to run from zero on,not from number one on, whereas in the English language we use one as the first index.Some programming languages use one, some use zero.Python uses zero, so you have to know this.With those hints, you should be able to fill in the gaps,write your code over here, return something,and if you print the median off this data sequence over here,then it should output a 2, which is the median.So here is the code.To solve this, first make a new sorted list of the data and then find the corresponding element in the list to return.Let's assume there is an odd number of items in the listso things should always be fine.

#Complete the median function to make it return the median of a list of numbers
data1=[1,2,5,10,-20]
def median(data):
    #Insert your code here

print median(data1)

04 Median Solution

So here is my solution.I created a new data structure called sdata, and I get it by sorting it using the sorted command applied to data.So that's Python notation.If you've never programmed before,get the notation by saying we get this thing data,we run it through the sorter, out comes something new,and we assign it to the left side over here--this new thing called sdata.I can just make these things up.And then I ask myself, "What's the right index?"So if sdata is of length 5, the index I want is not number 3, but number 2 because the indexing starts at zero.So say len(data) returns 5.I subtract 1. It gives me 4.And I divide the 4 by 2.That gives me the index number 2, and that works for any data length.So if I had 7, for example, 7 minus 1 is 6.Divided by 2 is 3. It always gives me the middle element.And then we just return with this indexing over here the center element.This return with this command over here,the element of the sorted list that's right in the middle.And that returns the median.So if I hit the run button, I get back 2.

#Complete the median function to make it return the median of a list of numbers
data1=[1,2,5,10,-20]
def median(data):
    #Insert your code here
    sdata = sorted(data)
    index = (len(data) - 1) / 2
    return sdata[index]

print median(data1)

05 Mode

Now things should become challenging. I ask you to program the mode of a data set.Of course, data sets can have multiple modes.I just extended mine to have 3 times the element 5.Let's assume for now if you have multiple modes you can return either one; I don't care.But if it's unique, I want the correct mode.So this is entirely nontrivial for programming,and it's a real challenge.This is what makes this class fun.When you hit print mode of this data set then I want it to return 5.Now some hints: There are many different ways to implement it using complex data structures such as sets,but at the minimum for a simple solution,you should know what a for loop is,what an if statement is,and then there's a beautiful function called "data.count."You give it an argument--like this case the number 5--and it returns to you how often this specific number occurs in the data over here--three times for number 5.If you were to give it 6 as an argument,it doesn't occur at all,the result would now be zero.So my solution has a funny "for" notation.For variable "i" in the range of the length of data.Len(data), we know, is the number of elements.There's 7 over here, so it would give me 7.Range turns this into a list from zero to six of indices into the data,and then the for loop goes to these indices.That's one way to access each element in the data set sequentially.Now the count thing allows me to count how often each element occurs.So if you took my data--took the i'th data item--then what this thing gives me--this funny thing it gives me.How often inside this function do I see that specific number that is the i'th number in the original list?I leave it at this. This is not a trivial question.If you get stuck, you might go to the web and read up on for, on count, on if statements.So here is the code, and good luck programming it.

#Complete the mode function to make it return the mode of a list of numbers
data1=[1,2,5,10,-20,5,5]
def mode(data):
    #Insert your code here

print mode(data1)

06 Mode Solution

My solution, I admit, is not the most elegant, but it's the simplest that I could find without going into more complicated data structures.The key thing is that I'm going to go through all my data items,there's a for loop over here.As I explained before, this gets me the length of the data sequence;this gives me a list of all the indices from zero to six, in this case.We're going to go through these indices one after another in this variable "i."Now comes the tricky part. I pick the i'th data item,and I count how often does this occur in the entire list.For the first data item, 1--the argument here would be 1--and the count will give me 1 because there's exactly 1 occurrence of 1.But for the third item--as we know, "i" will be 2 now, going from zero on--then this thing over here will give me the number 5.I'm going to hit count.I get back 3 because there's 3 occurrences of 5.Now I need to find the maximum count,and specifically the data item that maximizes the count.For that, I've implemented a variable called "modecnt" that I've set to zero.If my current count exceeds modecnt, then I've found a new winner.So as a new winner, I'm going to set the new data item to the winner--the mode--and I update modecnt to reflect the fact that this new winner has a higher count in the data set than my initial zero.I iterate this. Out comes the mode.When I run it for this data set, I get 5.If you got this right, then you know a lot about programming.This was really a nontrivial programming quiz.

#Complete the mode function to make it return the mode of a list of numbers
data1=[1,2,5,10,-20,5,5]
def mode(data):
    #Insert your code here
    modecnt=0
    for i in range(len(data)):
        icount=data.count(data[i])
        mode=data[i]
        modecnt=icount
    return mode

print mode(data1)

07 Variance

Now things will become even more complicated.I'm going to give you a list of numbers is lightly different than the one from before--those are actually floating numbers. They have a decimal point.And I want you to implement the function variance.It takes our data and returns a single number,which is the variance of the data.So for the data set I will be giving you,it so turns out that the variance is 62.572.Now some hints.First you're going to use the function mean,that you have already programmed, so it's in your code. Just use it.And then the trick that I want to play is:We have our list here inside the function data.We're going to transform this into a new list called "ndata."It is the normalized data, which effectively is the data minus the mean,which I'll call "mu."So you compute the mean, called it "mu," perhaps,and then subtract from the data--from each data item--the mu.And this subtraction is not entirely trivial to make the new data set.The commands I have been using is I interactively construct with a for loop the data set.First I set it to an empty list.Then there's this function called "append"--"dot-append,"and whatever's inside the function--you've got to figure this out--will be appended to this list over here.So with an initial assignment of an empty list and a for loop,we'll go through all the data items and I'll append the appropriate thing to this new data.I get out this new data list.Then I can apply the same mean function to the new data list,and that gives me the variance. You got it?If not, listen to the video again.So here's the coding environment.Our data set. I've given you the function mean that we've programmed before,and now you are to program the function variance.So if you print variance of data set,you get the desired 62.572884.

#Complete the variance function to make it return the variance of a list of numbers
data2=[13.04, 1.32, 22.65, 17.44, 29.54, 23.22, 17.65, 10.12, 26.73, 16.43]
def mean(data):
    return sum(data)/len(data)
def variance(data):
    #Insert your code here

print variance(data3)

08 Variance Solution

And here's my solution.First I compute the mean using the mean function.Assign it to a new variable that I make up called mu.So this is now inside the variance; I now know the mean.Here's how I construct my new lists--my ndata lists.I first set it to the empty lists.Then I go through every index in the original data structure.So len(data) as we know gives me the number of elements.Range constructs a list of indices from zero to one minus that.Then I go through it sequentially,and that allows me to index the i'th item in my data structure.Now I take the i'th item.I subtract the mean.This is a very short notation for squaring it.I could've just done the same times itself.But this is my new i'th element that I append into my ndata set.So we're doing this with a for loop.I get a parallel data set called ndata that is data minus the mean, squared.Now I compute the mean of that new ndata and return it; that's my variance.It turns out there's an even more elegant version to do this in Python.It requires a good amount of programming skills that are probably way beyond most beginners' programming skills.Here is how it goes:You can do this in two lines.You can compute the mean, as we've done beforewith our function mean that we programmed ourselves.And here is the tricky part.We're now going to compute the mean of a new list.We can construct this new list in a single line--that's the key thing here.The way we construct it is we go through all the data items for x in the data and we construct a new list out of it where the math for constructing the new list is x minus mu to the square.So this loops through all the data items, applies this equation to each data item,constructs the new lists, passes these lists on to the mean function,and returns the correct value.So this is one of the compact ways of programming the variance.

#Complete the variance function to make it return the variance of a list of numbers
data2=[13.04, 1.32, 22.65, 17.44, 29.54, 23.22, 17.65, 10.12, 26.73, 16.43]
def mean(data):
    return sum(data)/len(data)
def variance(data):
    #Insert your code here
    mu = mean(data)
    ndata = []
    for i in range(len(data)):
        ndata.append((data[i] - mu)**2)
    sigma2 = mean(ndata)
    return sigma2

print variance(data2)

09 Standard Deviation

So as my last exercise for you, I want you to program the standard deviation.I won't say much other than you might want to use the function "sqrt"that returns the square root of its argument.So here's the coding environment.I'm actually importing the function sqrt for you from the math library.Otherwise it doesn't work.Data structure. Mean.Our compact version of variance,and I want you to program standard deviation so that when you hit "stddev" for data2,you get 7.9103 and so on for the data set I'm giving you.So just enter the right thing at standard deviation.

#Complete the stddev function to make it return the standard deviation 
#of a list of numbers
from math import sqrt

data2=[13.04, 1.32, 22.65, 17.44, 29.54, 23.22, 17.65, 10.12, 26.73, 16.43]


def mean(data):
    return sum(data)/len(data)
def variance(data):
    mu=mean(data)
    return mean([(x-mu)**2 for x in data])
def stddev(data):
    #Insert your code here

print stddev(data2)

10 Standard Deviation Solution

And this was an easy one.All we take is the function variance(data)and pass it into the square root and return the result in standard deviation,and this gets us the correct solution.

#Complete the stddev function to make it return the standard deviation 
#of a list of numbers
from math import sqrt

data2=[13.04, 1.32, 22.65, 17.44, 29.54, 23.22, 17.65, 10.12, 26.73, 16.43]


def mean(data):
    return sum(data)/len(data)
def variance(data):
    mu=mean(data)
    return mean([(x-mu)**2 for x in data])
def stddev(data):
    #Insert your code here
    return sqrt(variance(data))

print stddev(data2)

11 Congratulations

So thank you, and congratulations for getting this far!Some of these programming exercises were somewhat nontrivial,and if you've never programmed before and got them right,I'm really impressed.So this completes this optional unit. Let's move on to the next unit.Let's move on to the next unit.