These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!
Please check the wiki guide for some tips on wiki editing.
Awhile back I've told you about the Simpson's paradox and it was surprising how easy it was to draw the false conclusion from data. Today I will give you a deep insight of the common mistakes that's being madein interpreting statistical data by confusing correlation with causation.I'll show you example where data is correlated and why it's tempted to confuse correlation with causation.So both of those are words that start with a C and very frequently I read newspaper articles that deeply confuse both the relationship of correlation and causation--so let's dive in.
Suppose you are sick,and you wake up with a strong pain in the middle of the night.You so sick that you fear you might die,but you're not sick enough not to apply the lessons of my Statistics 101 class to make a rational decision whether to go to the hospital. And in doing so, you consult the titer. You find that in your town, over the last year, 40 people were hospitalized of which 4 passed away. Whereas the vast part of the population of your town never went to the hospital, and of those, 20 passed away at home.So compute for me the percentages of the people who died in the hospital and the percentage of the people who died at home.
And the answer is quite obviously 10%of the people in the hospital died, 4 over 40,whereas only 2.5% of the people at home passed away.
Now I offer these as a fictitious example – these are relatively large numbers. But what’s important to notice is the chances of dying in a hospital are 40 times as large than dying at home. That means whether you die or not is correlated to whether or not you are in a hospital. So the chances of dying in a hospital are indeed 40 times larger than at home. So let me ask the critical question. Shall you now stay at home, given that you are a really smart statistics student, can you resist the temptation to go to the hospital because indeed it might increase your chances of passing away.
And realize both answers count, but clearly the correct answer is no. You should go to the hospital.Hospitals don’t cause sick people to die.I know that there has been lots of studies that show if a perfectly healthy person goe sto a hospital, they might actually catch a disease there but hospitals have a reason,they want to cure people. Why is this interesting? Because based on the correlation data, it seems that being in a hospital makes you 40 times as likely to die than being at home but that doesn’t mean by staying at home, you reduce your chances of dying So this is a statement of correlation.
Being in a hospital, that fact alone, increases your probability of dying by a factor 40 is a causalstatement. It says the hospital causes you to die.Not just it coincides with the fact that you dieand very frequently people in the public getthis wrong. People observe there is a correlationbut they suggest the correlation is causal inattempting to make you understand the statisticas a call of action. Now, to understand whythis could be wrong let’s dive in a little bitdeeper into the same example
Let's say of the 40 people in the hospital,36 were actually sick and passed away,and some were healthy, 4 of them, and they all survived.Let's further assume, for the people at home,40 were indeed sick,and 50 of them passed away,whereas the remaining 7,960,they were healthy,also inquired a total death of 20,perhaps because of accidents.These statistics are consistent with the statistics I gave you before.We just added another variable,whether the person's sick or healthy.Please now fill out once again,in percent,what is the percentage of people that passed away in each of of those 4 groups?
When you divide 4 by 36, you get 0.111, or in percent, 11.1%. That's the mortality rate of sick people in the hospital.It's 0 for the healthy people.The mortality rate at home is 50%--half of them pass away--and 0.25% approximately for healthy people.Now, if we look at this, we realize that you are likely sick.If you fall into the sick category,your chances of dying at home are 50% and it's just about 11% in the hospital.So, you should really go to the hospital very quickly
Let's observe in more detail why the hospital example gives us such a wrong conclusion.We study two variables--in-hospital and dying or passing away. We rightfully observe that these two things are correlated.If we were to do a scatter plot where we have two categories--whether or not we're in the hospital and whether or not a person passed away--you find there's an increased occurrence of data over here and of data over here relative to the other to data points over here.That means the data correlates.What does correlation mean?In any plot, data is correlated if knowledge about one variables tells us something about the other.This is a correlated data plot. Here's another data plot.Correlated or not? Yes or no?
The answer is no. No matter where I am in A, B seems to be the same.
And now the data sits,a square in which data is uniformly arranged,correlated, yes or no?
The answer is negative. No matter where I am in A,the range for B is the same,as is the mean estimate.
Another data set--there's boomerang over here. Correlated? Yes or not.
The answer is yes--clearly for different values of A, I get different values of B. Another linear correlation yet still a correlation.
So clearly in our example,whether or not you're in a hospital correlated with whether or not you died,but the truth is, the example omitted and important variable,the sickness, the disease itself.And in fact, the sickness did cause you to die,and also effected your decision of whether you go to a hospital or not.So if you draw acts of causation,you find sickness causes death,and sickness causes you to go to the hospital,and if anything at all,once you knew you were sick,being in the hospital negatively correlated with you dying;that is, being in a hospital made it less likely for you to pass away given that you were sick.In statistics, we call this a confounding variable.It's very tempting to just omit this from your data,and if you do, you might find correlations;in this case, a positive correlation between the hospital and death,that have nothing to do with the way things are being caused,and as a result, those correlations don't relate at all to what you should do.So let's study another example.
Suppose you observed a number of different fires and you graph the number of firefighters versus the size of the fire.And for the sake of the argument, let's assume we studied four fires with 10, 40, 200, and 70 firefighters involved and the sizes of the fires were given as follows: 100, 400, 2000, and 700 in terms of the surface area that the fire occupied.Putting this into a diagram, you get pretty much the following.Put the number of fighters. In fact, you've already learned this looks very linear.So, let me ask a question--is the number of the firefighters correlated with the size of the fire?Yes or no? Check one of the two boxes.
And obviously it is because there's a strong linear correlation.
Now the real question I'm bringing up to is, "Do firefighters cause fire?"or more extremely, "If you're going to get rid of our firefighters, will you get rid of all the fire? "Obviously, this seems to be in the data.
And the answer is no. This is a case of what we call reverse causation.You can argue that the size of the fire causes the number of firefighters that is being destroyed,and that's because the bigger the fire the more firefighters the fire department will send.Now our graph, which shows the correlation between these two variables is oblivious to the direction of this arc.You could conclude size causes this and fire than the firefighters.You could conclude the number of firefighters causes this size.In both cases you could use exact the same data.But when I put it this way, it's pretty obvious that the right answer should be the size of the fire causes the number of firefighters to grow up and it's not from the data itself.It's because we know there's something about fire and firefighters.It's impossible to deduce from this data that it causes a relationship.It could be just coincidental or that cause a relationship could go either way.
So here's my assignment for you.Go online and check old news articles and find me one that takes data that show a correlation and from the data suggests causation or differently it tells what you what to do.I argue with the news is full of this abuses of statistics and we will talk later about how to set up a study to avoid this trap, but in your assignment if you find an article that has this property extract that text and post it to the discussion forum.I will be monitoring and comment on those and I want all of us to enjoy what kind of hilarious funny misinterpretations of statistics you can find that people confuse correlation and causation So go ahead and find me interesting articles. Thank you!