st101 »

Revision Notes

That's a page to share helpful short notes with fellow students. Thanks to @gabe for getting this started.


3. Scatter plots

Scatter plots are good when data is 2-dimensional. Not so good for 125 dimensions.

Data is linear when all data is on a straight line – this is rare, e.g. price, y, per square foot, x, is the same, a, for all houses (y=a \cdot x) or more generally:

  • y= a \cdot x+b
  • price = a \cdot size + constant

Outliers are data points which don’t fit easily with the majority of the data.

Noise is when data is randomly distributed around mean (not good for scatter plots, use bar charts instead).

4. Bar charts

Bar charts are useful to group data into intervals and thus eliminate some random noise and show global trends. It is a cumulative tool.

Histogram is a special type of bar chart, which examines only one dimension of data, plotted against frequency on the y-axis, e.g. age histogram: plots how many people are in 0-10, 11-20, 21-30 etc brackets.

Bar charts and Histograms both aggregates the data.

5. Pie charts

Pie charts are useful to visualize relative data quickly and intuitively (e.g. Party A got 25% of the vote, Party B got 15%, Party C 2% & remainder didn’t vote). Proportions can be seen quickly.

6. Python charts

Optional. Can use Udacity web interface to plot charts or to make your own Python charts, you need NumPy and Matplotlib installed. Refer to Plotting Graphs with Python for information on that.

This will plot a bar chart in Terminal:

from matplotlib import pyplot
from pylab import randn
x = randn(1000)
y = pyplot.hist(x, bins=100)


7. Gender bias

It shows that statistics is deep and often manipulated. Example, while looking on individual major, females seems to be favored but on aggregate date the males are being favored.

8. Probability

A fair coin has a 50% chance of coming up heads:

Probability of an event, P

P(heads) = P(tails) = 0.5

Loaded coin example:

  • P(H) = 0.9
  • P(T) = 0.1

P(H) + P(T) = 1 (always true).

Probability of an opposite event, (1-P)

P(A) = 1 - P(\neg A)

Loaded coin with 2 heads: P(H) = 1 and P(T) = 0.

Probability of a composite event, P.P.P …

Where these are independent events.

P(H,H) = (two heads in a row) = P(H).P(H) = 0.25 for fair coin.

Truth Table

A truth table shows all outcomes: (HH,HT,TH,TT). So, \frac{1}{4} chance each . Always sum = 1.

If P(H) = 0.9, then:

  • P(H,H) = 0.81
  • P(H,T) = 0.09
  • P(T,H) = 0.09
  • P(T,T) = 0.01

So P(\text{2 heads in a row}) = 0.81 and P(\text{Only one head}) = 0.09 \times 2 = 0.18.

For 3 flips with loaded coin, P(\text{Only one head}) = (0.9 \times 0.1 \times 0.1) \times 3 = 0.027.

9. Conditional Probability

Dependent events: Event A influences event B, or B is dependent on A (e.g, becoming professor depends on being smart).

Let's say half of total population is smart: P(smart) = P (dumb) = 0.5:

  • P(prof) = 0.001

  • P(prof|smart) = 0.002

  • P(prof|dumb) = 0.000

i.e, probability of being prof is \frac 1{1000}, prob of being prof if smart is \frac 1{500} and prob of being prof if dumb = 0, e.g., cancer test.

Prior probability

  • P(C) = 0.1

  • P(\neg C) = 0.9

i.e., 10% of the general population has cancer and 90% no cancer.



  • P (pos \mid C) = 0.9

  • P (neg \mid C) = 0.1

i.e. if the person has cancer, the test will be positive (i.e., correct) in 90% of cases – this is the sensitivity of the test. And negative (i.e., false negative or incorrect) in 10% of cases.



  • P (pos \mid \neg C) = 0.2

  • P (neg \mid \neg C) = 0.8

i.e., if the person does not have cancer, the test will be positive (i.e., false positive or incorrect) in 20% of cases. And negative (i.e., correct, all clear) in 80% of cases – the latter is the specificity of the test.


  • Knowing the prior is not always easy in practice.

  • Sensitivity and specificity are distinct numbers. They don't have to be equal, or sum to 1.

Truth table (sum = 1):

  • P (C , pos) = 0.1 \times 0.9 = 0.09

  • P (C , neg) = 0.1 \times 0.1 = 0.01

  • P (\negC , pos) = 0.9 \times 0.2 = 0.18

  • P (\neg C , neg) = 0.9 \times 0.8 = 0.72

Total probability

P (pos) = 0.09 + 0.18 = 0.27 = P(pos \mid C) * P(C) + P(pos \mid \neg C) * P(\neg C), e.g. two coins (x and y) in a bag (slightly different from class example). x is fair so P (H|x) = 0.5 and y is loaded P(H \mid y)=0.9. There's a equal chance of picking either.

  • P (\text{pick 1 coin, flip it twice and both times tails}) = P (T,T)

To solve, make truth table, 3 columns (pick, flip, flip), 8 rows:

  • P (T,T) = P(x,T,T) + P(y,T,T) (so 2 rows match)

  • P (T,T) = (0.5 \times 0.5 \times 0.5) + (0.5 \times 0.1 \times 0.1)

10. Bayes Rule

Named after Reverend Thomas Bayes. Very important theory in probability.

Cancer test example


P(C) = 0.01 … i.e 1% of general population have this cancer

Sensitivity of test

P(pos | C) = 0.9 … i.e.  for those with cancer: 90% correctly diagnosed as positive

Specificity of test

P(neg | \neg C) = 0.9 … i.e. for those without cancer: 90% of correctly diagnosed as negative (so test misdiagnoses 10% as positive)

Joint probability

Draw a Venn diagram of two intersecting circles (a small one for "cancer" (1%) and a larger one (about 10%) for "positive test"). You can see that, if in a random test of the general population 10% (of 99% of the population) are given a false positive (B), this number outweighs the 90% (of 1% of the population) given a true positive (A). A and B are the joints.


The probability of a positive result being correct in a screening of the general population is thus only \frac A{A + B}.

Conversely, the probability of a positive result being incorrect in a screening of the general population is \frac B{A + B}.

(A + B) is the normaliser , so the positive-test probabilities sum to 1, and can be expressed as percentages.

\frac A{A + B} + \frac B{A + B} = 1

Posterior probability

\frac {P(pos | C) \cdot P(C)}{P(pos | C) \cdot P(C) + P(pos | \neg C) \cdot P(\neg C)} = \frac {0.9 \cdot 0.01}{0.9 \cdot 0.01 + 0.1 \cdot 0.99} = 8.3 %

i.e. in a random test of the general population, a positive result for this cancer only has 8.3% accuracy. This is a posterior. Similarly for a false positive …

\frac {P(pos | \neg C) \cdot P(\neg C)}{P(pos | C) \cdot P(C) + P(pos | \neg C) \cdot P(\neg C)} = \frac {0.1 \cdot 0.99}{0.9 \cdot 0.01 + 0.1 \cdot 0.99} = 91.7 %

i.e. in a random test of the general population, a positive result has 91.7% chance of being wrong. As above, this is a posterior probability.

This may seem counterintuitive.

The main reason that the test gives a high proportion of false positives is that the actual incidence of this cancer in the general population (the prior) is very small (99% are cancer-free). So a small percentage (10%) of that 99% that test false-positive still make a large number of people, when compared with those with cancer (max 1%, even if the sensitivity is perfect).

… more to follow …

13. Estimators

Maximum Likelihood Estimator (MLE) looks at a given data set and uses it to make the best guess of future outcomes.

E.g. if past die throws show an equal number of ones, twos, threes etc., you can estimate that it is most likely that the die is fair and predict an equal \frac1{6} likelihood of each number being thrown in the future. P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = \frac1{6}

Extreme example: say past coin flips show only heads 100 times. This could be a fair coin, i.e. P(H) = 0.5, with a very unlikely outcome: P = \frac 1{2^{100}} . Or most likely, it is a weighted coin where P(H) = 1 . It could also be almost any other type of loaded coin P(H) = x , where 1\geqslant x>0, but the most likely is when P(H) = 1 . If you plot likelihood vs x you get a curve with the MLE at max point.

Extreme, extreme example: say only 1 coin flip: a head. MLE gives P(H) = 1 , i.e. 100% weighted coin. This is silly, so in these cases, with small data sets, use …

Laplacian Estimator: Add fake data to smooth results

e.g. Dice throw data {3,4,6} \triangleright {3,4,6,1,2,3,4,5,6}, added one of each throw.

e.g. Coin flip {H} \triangleright {H,H,T} gives a better result . Now estimate is P(H) = \frac{2}{3}.

14. Mean, Median and Mode


\mu = \frac{1}{N} \Sigma X_i (sum of values, i.e. X_1 + X_2 + etc where X is a value, divided by number of values, N). Dividing by number of terms is called normalising.

mean of {1,2,6} is 3 (because \frac{1+2+6}3=3).

If Y = \alpha X + \beta, then \mu_Y = \alpha \mu_X + \beta.

In particular, if Y = X + \beta, then \mu_Y = \mu_X + \beta.  i.e. adding a constant \beta to all values, moves the mean by \beta too.

Similarly, if Y = \alpha X, then \mu_Y = \alpha \mu_X.  i.e. multiplying all values by constant \alpha, multiplies the mean by \alpha too.


median of {1,2,3,4,100} is 3 (median is middle value when numbers ordered). When even no. of terms, pick one of two, or take mean of both. Useful in typical house price example, as it effectively disregards very expensive outlier.


mode of {1,1,1,2,3,3,100} is 1 (most frequent value. When more than one possibility, pick one). Useful for multi-modal or bi-modal data (where data has "bumps", it picks the value corresponding to top of highest bump).

15. Variance

\sigma: Standard Deviation.

\sigma^2: Variance (Measures spread of data away from the mean).

\sigma^2 = \frac1{N} \Sigma( X_i-\mu)^2

\sigma^2 = \frac{\Sigma X_i^2}{N} - \frac{(\Sigma X_i)^2}{N^2}

We only need to know N, \Sigma X_i (sum of values), and \Sigma X_i^2 (sum of squares) to compute the formula above

If Y = \alpha X + \beta, then \sigma_Y = | \alpha | \sigma_X.

i.e. adding constant \beta to all values has no effect on the standard deviation. But, multiplying all values by constant \alpha multiplies the standard deviation by \alpha too. [extra detail: | \alpha | effectively means that whether \alpha is positive or negative, you take the positive value of \alpha]

z: Standard Score

z = \frac{x-\mu}\sigma

Where x is a point in a Gaussian distribution that you want to calculate a standard score for. Is negative when x on left of mean. Is zero when x = mean.