top of page

Bias in, bias out: how to avoid the big data trap

Written by Vincent Govaers

As digitalization conquers our society at an unprecedented rate, data is perhaps the most important asset for most organizations. It can allow for the analyses of consumer behavior to make more informed individual recommendations, it can be used to analyze sales figures to promote top-performing employees, and it can even be used to build complex algorithms to recognize diseases that would otherwise be invisible to the human eye. Rational decision-making seems to claim victory with this infinite power at our disposal. Hence, the reason why almost every business manager asks the same question to their analysts presenting their insights: where is the data?

But why are we obsessed with making decisions based on data? Data in large volumes and velocities can provide insights that could otherwise never be generated on our own. We perceive these analyses as being objective and fundamental as they are based on data and designed using solid mathematical principles. However, caution is advised as opinions can easily sneak into statistical models, and data is not always right.

An illustration will help make this clear: Let’s say we want to build an algorithm that will help us find which doctors have the best surgical skills. In the medical industry, the number of complications after surgery is often used as a factor to calculate the skill level of a doctor. At first, this might seem logical. A doctor, who has 90% of patients without complications three months after surgery, will be classified as highly skilled. Whereas a doctor, who has around 60% patients without complications, will be categorized as low skilled. This skills profile could help determine the salaries of doctors, the need for additional training, or if a doctor should be fired.

However, before taking irreversible decisions based on this algorithm, we should break down what an algorithm is composed of:

As the figure above illustrates, an algorithm is composed of 2 parts:

  1. Historical data: the data needed to be collected and ingested into an algorithm (e.g., the number of complications three months after surgery).

  2. Definition of success: a decision that needs to be made on which data we want to feed into the algorithm and how we define success (e.g., “a good doctor ensures the patient has no complications after surgery”).

There are a couple of things that could go wrong with each of these components. First, the historical data can lack quality: data can be inaccurate, values can be missing, datasets can be too small, etc. Second, the definition of success is always subjective. In all cases, there will be a person that is defining success. However, another person could disagree and define success dif