- You are here: Home > Data Science: Correlation and Causation
You are here
Data Science: Correlation and Causation
Why does correlation not always imply causation?
Data science is absolutely a trending buzzword, and it combines multiple fields, including statistics, scientific methods, and artificial intelligence with the goal of extracting an explanation from data. Analysis and/or visualization, however, may not always be a good storyteller. This article discusses one of the most common fallacies that occurs while analyzing data: inference of a causal relationship in the event of a spurious relationship (drawing the false conclusion that correlation implies causation).
Rainfall Causes Umbrella Sales
Correlation: refers to the statistical relationship between two entities. In other words, it is how two variables affect one another.
Causation: indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events.
In this particular instance, the rainfall causes the sale of the umbrella. This chart tells us the positive correlation between umbrellas and whether rainfall causes umbrella sales: Yes.
Just because two things happen simultaneously, however, does not mean that one caused the other. For example:
Let's Talk Lemons
Lemons Save Lives
According to the chart to the left, traffic fatalities fell simultaneously with the increase in lemon imports from Mexico. If the relationship were causal, one would conclude that the more lemons the US imports from Mexico, the fewer traffic fatalities they could expect on roadways, making the lemon a true hero.
Spurious Hero Lemons
The fact is, however, that the relationship between lemon imports and traffic fatalities is what we call spurious. There is absolutely a mathematically visible relationship between the two, but we also know that factors apart from lemon imports reduced the number of traffic fatalities.
Whenever data is concerned, it is important to validate relationships thoroughly before accepting conclusions. This is one of the places where artificial intelligence and machine learning require additional work, as pattern recognition, statistical significance, and correlative data relationships alone can create scenarios in which even computer models can draw false causal relationships. It is important to use the scientific method after a potential causal relationship has been identified in order to validate it completely and avoid traps like "buying ice cream increases the risk of a shark attack."