Bias in, bias out: how to avoid the big data trap

Author: Vincent Govaers


As digitalization conquers our society at an unprecedented rate, data is perhaps the most important asset for most organizations. It can allow for the analyses of consumer behavior to make more informed individual recommendations, it can be used to analyze sales figures to promote top-performing employees, and it can even be used to build complex algorithms to recognize diseases that would otherwise be invisible to the human eye. Rational decision-making seems to claim victory with this infinite power at our disposal. Hence, the reason why almost every business manager asks the same question to their analysts presenting their insights: where is the data?


But why are we obsessed with making decisions based on data? Data in large volumes and velocities can provide insights that could otherwise never be generated on our own. We perceive these analyses as being objective and fundamental as they are based on data and designed using solid mathematical principles. However, caution is advised as opinions can easily sneak into statistical models, and data is not always right.


An illustration will help make this clear: Let’s say we want to build an algorithm that will help us find which doctors have the best surgical skills. In the medical industry, the number of complications after surgery is often used as a factor to calculate the skill level of a doctor. At first, this might seem logical. A doctor, who has 90% of patients without complications three months after surgery, will be classified as highly skilled. Whereas a doctor, who has around 60% patients without complications, will be categorized as low skilled. This skills profile could help determine the salaries of doctors, the need for additional training, or if a doctor should be fired.


However, before taking irreversible decisions based on this algorithm, we should break down what an algorithm is composed of:



As the figure above illustrates, an algorithm is composed of 2 parts:

  1. Historical data: the data needed to be collected and ingested into an algorithm (e.g., the number of complications three months after surgery).