top of page

Bias in, bias out: how to avoid the big data trap

Written by Vincent Govaers

As digitalization conquers our society at an unprecedented rate, data is perhaps the most important asset for most organizations. It can allow for the analyses of consumer behavior to make more informed individual recommendations, it can be used to analyze sales figures to promote top-performing employees, and it can even be used to build complex algorithms to recognize diseases that would otherwise be invisible to the human eye. Rational decision-making seems to claim victory with this infinite power at our disposal. Hence, the reason why almost every business manager asks the same question to their analysts presenting their insights: where is the data?

But why are we obsessed with making decisions based on data? Data in large volumes and velocities can provide insights that could otherwise never be generated on our own. We perceive these analyses as being objective and fundamental as they are based on data and designed using solid mathematical principles. However, caution is advised as opinions can easily sneak into statistical models, and data is not always right.

An illustration will help make this clear: Let’s say we want to build an algorithm that will help us find which doctors have the best surgical skills. In the medical industry, the number of complications after surgery is often used as a factor to calculate the skill level of a doctor. At first, this might seem logical. A doctor, who has 90% of patients without complications three months after surgery, will be classified as highly skilled. Whereas a doctor, who has around 60% patients without complications, will be categorized as low skilled. This skills profile could help determine the salaries of doctors, the need for additional training, or if a doctor should be fired.

However, before taking irreversible decisions based on this algorithm, we should break down what an algorithm is composed of:

As the figure above illustrates, an algorithm is composed of 2 parts:

  1. Historical data: the data needed to be collected and ingested into an algorithm (e.g., the number of complications three months after surgery).

  2. Definition of success: a decision that needs to be made on which data we want to feed into the algorithm and how we define success (e.g., “a good doctor ensures the patient has no complications after surgery”).

There are a couple of things that could go wrong with each of these components. First, the historical data can lack quality: data can be inaccurate, values can be missing, datasets can be too small, etc. Second, the definition of success is always subjective. In all cases, there will be a person that is defining success. However, another person could disagree and define success differently. Although an algorithm can look perfectly objective at first sight, a subjective component is always embedded within it.

In the medical example stated this is also the case. Statistics have shown that the patient's health before surgery is the most influential variable in predicting complications. A doctor who only operates on young, healthy patients will be labeled as a highly-skilled practitioner. On the other hand, a doctor who performs more complex surgeries on a higher-risk demographic will have higher chances of seeing more complications afterward. Thus, the model itself could be improved and designed more effectively to take this variable into consideration. In the example illustrated it is quite clear. However, it is important to pay attention to how success is defined.

To use big data and modern processing tools effectively, it is crucial to assess the quality of the data fed into an algorithm. Garbage in will always result in garbage out. Next, algorithms should be tested continuously until the desired outcome and accuracy are achieved. Poor algorithms can easily remain undetected for many years if no one takes the step to investigate and speak up. Finally, decision-making should always remain human hands: algorithms are not perfect, data is not the only argument, and full transparency needs to be provided to avoid algorithms becoming a ‘black box’. We should at least understand the variables that are used to feed an algorithm, right?

The total volume of data created over the next three years will be more than the data created over the past 30 years. We see businesses include more and more data in their decision-making, but the data itself is rarely questioned. If we do not ask ourselves the right questions around data quality, variables used, or the subjectivity of the definition of success, it will be just a matter of time before the wrong decisions are taken. To avoid the Big Data trap, it is time to halt our blind trust in algorithms and to start understanding the unknowns and use the data to its full potential.

Are you looking for support to take full control over your data and create real business value? Do not hesitate to contact us! We are happy to grab a coffee and discuss your data challenges.


bottom of page