**These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!**

**Please check the wiki guide for some tips on wiki editing.**

Contents

- 1 12. Correlation vs Causation
- 1.1 01 Correlation And Causation
- 1.2 02 Mortality
- 1.3 03 Mortality Solution
- 1.4 04 Deciding
- 1.5 05 Deciding Solution
- 1.6 06 Assuming Causation
- 1.7 07 Considering Health
- 1.8 08 Considering Health Solution
- 1.9 09 Correlation 1
- 1.10 10 Correlation 1 Solution
- 1.11 11 Correlation 2
- 1.12 12 Correlation 2 Solution
- 1.13 13 Correlation 3
- 1.14 14 Correlation 3 Solution
- 1.15 15 Causation Structure
- 1.16 16 Fire Correlation
- 1.17 17 Fire Correlation Solution
- 1.18 18 Fire Causation
- 1.19 19 Fire Causation Solution
- 1.20 20 Assignment

Awhile back I've told you about the Simpson's paradox and it wassurprising how easy it was to draw the false conclusion from data.Today I will give you a deep insight of the common mistakes that's being madein interpreting statistical data by confusing correlation with causation.I'll show you example where data is correlatedand why it's tempted to confuse correlation with causation.So both of those are words that start with a C and very frequently I read newspaper articlesthat deeply confuse both the relationship of correlation and causation--so let's dive in.

Suppose you are sick,and you wake up with a strong painin the middle of the night.You so sick that you fear you might die,but you're not sick enough not to applythe lessons of my Statistics 101 classto make a rational decision whether to go to the hospital.And in doing so, you consult the titer.You find that in your town, over the last year,40 people were hospitalizedof which 4 passed away.Whereas the vast part of the populationof your town never went to the hospital,and of those, 20 passed away at home.So compute for me the percentagesof the people who died in the hospitaland the percentage of the people who died at home.

And the answer is quite obviously 10%of the people in the hospital died, 4 over 40,whereas only 2.5% of the people at home passed away.

Now I offer these as a fictitious example – theseare relativelylarge numbers. But what’s important to noticeis the chances of dying in a hospital are 40times as large than dying at home. Thatmeans whether you die or not is correlatedto whether or not you are in a hospital. Sothe chances of dying in a hospital are indeed40 times larger than at home. So let me askthe critical question. Shall you now stay athome, given that you are a really smartstatistics student, can you resist the temptationto go to the hospital because indeed itmight increase your chances of passing away.

And realize both answers count, but clearly the correctanswer is no. You should go to the hospital.Hospitals don’t cause sick people to die.I know that there has been lots of studiesthat show if a perfectly healthy person goesto a hospital, they might actually catch adisease there but hospitals have a reason,they want to cure people. Why is thisinteresting? Because based on the correlationdata, it seems that being in a hospital makesyou 40 times as likely to die than being athome but that doesn’t mean by staying athome, you reduce your chances of dyingSo this is a statement of correlation.

Being in a hospital, that fact alone, increases yourprobability of dying by a factor 40 is a causalstatement. It says the hospital causes you to die.Not just it coincides with the fact that you dieand very frequently people in the public getthis wrong. People observe there is a correlationbut they suggest the correlation is causal inattempting to make you understand the statisticas a call of action. Now, to understand whythis could be wrong let’s dive in a little bitdeeper into the same example

Let's say of the 40 people in the hospital,36 were actually sick and passed away,and some were healthy, 4 of them,and they all survived.Let's further assume, for the people at home,40 were indeed sick,and 50 of them passed away,whereas the remaining 7,960,they were healthy,also inquired a total death of 20,perhaps because of accidents.These statistics are consistent withthe statistics I gave you before.We just added another variable,whether the person's sick or healthy.Please now fill out once again,in percent,what is the percentage of people that passed awayin each of of those 4 groups?

When you divide 4 by 36, you get 0.111, or in percent, 11.1%.That's the mortality rate of sick people in the hospital.It's 0 for the healthy people.The mortality rate at home is 50%--half of them pass away--and 0.25% approximately for healthy people.Now, if we look at this, we realize that you are likely sick.If you fall into the sick category,your chances of dying at home are 50% and it's just about 11% in the hospital.So, you should really go to the hospital very quickly

Let's observe in more detail why the hospital example gives us such a wrong conclusion.We study two variables--in-hospital and dying or passing away.We rightfully observe that these two things are correlated.If we were to do a scatter plot where we have two categories--whether or not we're in the hospital and whether or not a person passed away--you find there's an increased occurrence of data over hereand of data over here relative to the other to data points over here.That means the data correlates.What does correlation mean?In any plot, data is correlated if knowledge about one variables tells us something about the other.This is a correlated data plot. Here's another data plot.Correlated or not? Yes or no?

The answer is no.No matter where I am in A, B seems to be the same.

And now the data sits,a square in which data is uniformly arranged,correlated, yes or no?

The answer is negative.No matter where I am in A,the range for B is the same,as is the mean estimate.

Another data set--there's boomerang over here. Correlated? Yes or not.

The answer is yes--clearly for different values of A, I get different values of B.Another linear correlation yet still a correlation.

So clearly in our example,whether or not you're in a hospitalcorrelated with whether or not you died,but the truth is, the example omittedand important variable,the sickness, the disease itself.And in fact, the sickness did cause you to die,and also effected your decision of whether you go to a hospital or not.So if you draw acts of causation,you find sickness causes death,and sickness causes you to go to the hospital,and if anything at all,once you knew you were sick,being in the hospital negatively correlatedwith you dying;that is, being in a hospital made it less likelyfor you to pass awaygiven that you were sick.In statistics, we call this a confounding variable.It's very tempting to just omit this from your data,and if you do, you might find correlations;in this case, a positive correlation between the hospital and death,that have nothing to do with the way things are being caused,and as a result, those correlations don't relate at allto what you should do.So let's study another example.

Suppose you observed a number of different fires and you graphthe number of firefighters versus the size of the fire.And for the sake of the argument, let's assume we studied four fireswith 10, 40, 200, and 70 firefighters involvedand the sizes of the fires were given as follows: 100, 400, 2000, and 700in terms of the surface area that the fire occupied.Putting this into a diagram, you get pretty much the following.Put the number of fighters. In fact, you've already learned this looks very linear.So, let me ask a question--is the number of the firefighters correlated with the size of the fire?Yes or no? Check one of the two boxes.

And obviously it is because there's a strong linear correlation.

Now the real question I'm bringing up to is, "Do firefighters cause fire?"or more extremely, "If you're going to get rid of our firefighters, will you get rid of all the fire?"Obviously, this seems to be in the data.

And the answer is no. This is a case of what we call reverse causation.You can argue that the size of the fire causes the number of firefighters that is being destroyed,and that's because the bigger the fire the more firefighters the fire department will send.Now our graph, which shows the correlation between these twovariables is oblivious to the direction of this arc.You could conclude size causes this and fire than the firefighters.You could conclude the number of firefighters causes this size.In both cases you could use exact the same data.But when I put it this way, it's pretty obvious that the right answer should be the size of the firecauses the number of firefighters to grow up and it's not from the data itself.It's because we know there's something about fire and firefighters.It's impossible to deduce from this data that it causes a relationship.It could be just coincidental or that cause a relationship could go either way.

So here's my assignment for you.Go online and check old news articles and find me one that takes data that show a correlationand from the data suggests causation or differently it tells what you what to do.I argue with the news is full of this abuses of statistics and we will talk later abouthow to set up a study to avoid this trap, but in your assignment if you find an article that has thisproperty extract that text and post it to the discussion forum.I will be monitoring and comment on those and I want all of us to enjoy what kind of hilariousfunny misinterpretations of statistics you can find that people confuse correlation and causationSo go ahead and find me interesting articles. Thank you!