Computers can be really good at pattern detection; there are whole fields of study (like machine learning, and artificial intelligence) that are based on doing just that.  But something that can be surprisingly difficult is identifying deviations from the patterns–often called outliers, because these data points fall outside the general trend.  Sometimes outliers occur because of statistical noise, or a problem in the data collection or recording.  But sometimes they are clues that something very interesting or important is going on.

This is the story of a pre-computer-era outlier, and how a very persistent doctor used it to revolutionize modern medicine.

A Court for King Cholera

One of the most famous outliers in all of data science comes from a mid-19th century cholera study.  In those days, the germ theory of disease was still unproven, and the most popular idea was that disease spread through bad air.  When a terrible cholera outbreak hit London, though, a doctor named John Snow (not the Game of Thrones character, although we love making the joke) was convinced that it was spreading via very small microbes in the water supply.  But how could he prove that?

Like most good researchers, he turned to visualization to help him unlock the clues.  He made a map of London, marking the cholera deaths as they added up–there was a clear concentration, but the pattern was intriguingly imperfect.  Some households were hit terribly, while other nearby institutions (a factory, a prison) were nearly untouched.

Snow-cholera-map-1

It was the appearance of a mysterious outlier that cracked the case wide open.  A woman living far away from the center of the epidemic also died of cholera.  What connected her case to all the others?  John Snow learned that she used to live in the epidemic-struck neighborhood, and loved the taste of the water from the Broad Street water pump–so much, in fact, that she got deliveries of it to her new house after she moved away.  With this convincing data point in hand, John Snow persuaded the authorities to break the handle off the Broad Street pump, stopping the flow of water from the contaminated pump and ending the epidemic. 

It took a while longer for the microbial theory of disease to gain wide acceptance, but John Snow’s map was a turning point.  Today, of course, much outlier detection is done with powerful computer algorithms and large datasets, instead of a pencil and paper.  But the basic idea remains, of being open to what interesting things the data might be trying to tell you.  As Isaac Asimov once said, “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka’ but ‘That’s funny’ …”

For more stories about interesting applications of data science and machine learning, check out our new podcast, Linear Digressions.