AI’s Unseen Culprit: Unravelling the Bias Within

Bias blog series: part 1

Introduction

In our world of increasingly artificial forms of intelligence, machines that are tasked with making complex decisions are becoming more and more prevalent. There is a growing body of literature indicating the usage of AI in various domains such as Business, high-stake decision-making, and over the past few years in the medical sector. With this growing prevalence, however, people have noticed concerning tendencies in said systems; That is, whilst being inherently designed to purely follow patterns in the data, they have shown signs of prejudice, in the sense that various sexist and discriminatory behavior can be observed. The recent European AI Act, also covers the matter of such prejudice rather extensively and sets a foundation for tackling problems associated with it. 

Throughout the years of technical documentation, people have tended to use the term “bias” to describe this skewed type of behavior towards certain demographics; a word whose meaning varies, causing confusion and complicating the task of addressing it.

This article is the first in a series of blog posts covering the topic of bias. In this series, we’ll aim to give you a clear, digestible understanding of bias in AI. We’ll introduce ways to measure and minimize bias and explore the role of synthetic data in this path to more fair systems. We’ll also give you a peek into how Syntho, a leading player in synthetic data generation, can contribute to this effort. So, whether you’re a practitioner looking for actionable insights or just curious about this topic, you’re in the right place.

Bias in Action: A Real-World Example

You may be wondering, “This bias in AI is all important, but what does it mean for me, for ordinary people?” The truth is, the impact is far-reaching, often invisible but potent. Bias in AI is not a mere academic concept; it’s a real-world problem with serious consequences.

Take the Dutch child welfare scandal as an example. The automated system, supposedly a tool created to generate fair and efficient results with minimal human intervention, was biased. It wrongly flagged thousands of parents for fraud based on flawed data and assumptions. The result? Families thrown into turmoil, personal reputations damaged, and financial hardship, all due to biases in an AI System. It is examples like these that highlight the urgency of addressing bias in AI.

people protesting

But let’s not stop there. This incident isn’t an isolated case of bias wreaking havoc. The impact of bias in AI extends to all corners of our lives. From who gets hired for a job, who gets approved for a loan, to who receives what kind of medical treatment – biased AI systems can perpetuate existing inequalities and create new ones.

Consider this: an AI system trained on biased historical data could deny a well-qualified candidate a job simply because of their gender or ethnicity. Or a biased AI system might deny a loan to a deserving candidate because of their postcode. These are not just hypothetical scenarios; they are happening right now.

The specific types of biases, such as Historical Bias and Measurement Bias, lead to such flawed decisions. They are inherent in the data, deeply rooted in societal biases, and reflected in the unequal outcomes among different demographic groups. They can skew the decisions of predictive models and result in unfair treatment.

In the grand scheme of things, bias in AI can act as a silent influencer, subtly shaping our society and our lives, often in ways we don’t even realize. All these above-mentioned points might lead you to question why haven’t actions been taken to stop, and whether it is even possible.

Indeed, with new technological advancements it becomes increasingly more accessible to tackle such problem. The first step to addressing this problem, however, is to understand and acknowledge its existence and impact. For now, the acknowledgement of its existence has been created, leaving the matter of “understanding” to still be quite vague. 

Understanding Bias

Whilst the original definition of bias as presented by the Cambridge dictionary does not stray too far away from the main purpose of the word as it relates to AI, many different interpretations are to be made of even this singular definition. Taxonomies, such as those presented by researchers such as Hellström et al (2020) and Kliegr (2021), provide deeper insights into the definition of bias. A simple glance at these papers will reveal, however, that a great narrowing of the definition of the term is required to effectively tackle the problem. 

Whilst being a change of events, in order to optimally define and convey the meaning of bias one can better define the opposite, that is Fairness. 

Defining Fairness 

As it is defined in various recent literature such as Castelnovo et al. (2022), fairness can be elaborated upon given an understanding of the term potential space. As it exists, potential space (PS) refers to the extent of capabilities and knowledge of an individual regardless of their belonging to a certain demographic group. Given this definition of the concept of PS, one can easily define fairness to be the equality of treatment between two individuals of equal PS, regardless of their observable and hidden differences in bias inducing parameters (such as race, age, or gender). Any diversion from this definition, also called Equality of Opportunities, is a clear indication of bias and merits further investigation.  

The practitioners amongst the readers might notice that achieving something as defined here might be completely impossible given the inherent biases existing in our world. That is true! The world we live in, along with all data collected from occurrences in this world, is subject to much historical and statistical bias. This, indeed, lessens the confidence of one day fully mitigating the impacts of bias on predictive models trained on such “biased” data. However, through the use of various methods, one can try to minimize the impacts of bias. This being the case, the terminology used in the rest of this blog post(s) will shift towards the idea of minimizing the impact of bias rather than fully mitigating it.

Okay! So now that an idea has been brought forth of what bias is and how one could potentially evaluate its existence; If we want to tackle the problem properly, however, we need to know where all these biases originate from.

Understanding the Sources and types

Existing research provides valuable insights into the different types of biases in machine learning. As Mehrabi et. al. (2019) have proceeded to divide biases in machine learning, one can divide biases into 3 major categories. Namely those of:

  • Data to Algorithm: a cattegory encomapssing biases that originate from the data itself. Might that be caused through poor data collection, inherent biases existing in the world, etc.
  • Algorithm to User: a category focusing on biases that stem from the design and functionality of the algorithms. It includes how algorithms might interpret, weigh, or consider certain data points over others, which can lead to biased outcomes.
  • User to Data: pertains to biases that arise from user interaction with the system. The manner in which users input data, their inherent biases, or even their trust in system outputs can influence outcomes.
graph

Figure 1: A visualization of the CRISP-DM framework for data mining; commonly used in data mining and relevant to the process of identifying the stages in which bias can come into existence.

Whilst the names are indicative of the form of bias, one might still have questions as to the types of biases one might categorize under these umbrella terms. For the enthusiasts among our readers, we’ve provided links to some literature related to this terminology and classification. For the sake of simplicity in this blog post, we will cover a few select biases which are relevant to the situation (Almost all of which are of the category data to algorithm). The specific types of biases are as follows:

  • Historical Bias: A type of bias inherent to the data caused by the natural biases existing in the world in different social groups and society in general. It is because of the inherence of this data in the world that it cannot be mitigated through various means of sampling and feature selection.
  • Measurement Bias & Representation Bias: These two closely related biases occur when the different subgroups of the dataset contain unequal amounts of “favorable” outcomes. This type of bias can therefore skew the outcome of predictive models
  • Algorithmic Bias: Bias purely related to the algorithm in use. As also observed in tests ran (elaborated upon further in the post), this type of bias can have a tremendous effect on the fairness of a given algorithm.

These foundational understandings of bias in machine learning will be utilized in order to tackle the problem more effectively in later posts.

Final Thoughts

In this exploration of bias within artificial intelligence, we’ve illuminated the profound implications it holds in our increasingly AI-driven world. From real-world examples like the Dutch child welfare scandal to the intricate nuances of bias categories and types, it’s evident that recognizing and understanding bias is paramount.

While the challenges posed by biases — whether they be historical, algorithmic, or user-induced — are significant, they are not insurmountable. With a firm grasp on the origins and manifestations of bias, we are better equipped to address them. However, recognition and understanding are just the starting points.

As we move forward in this series, our next focus will be on the tangible tools and frameworks at our disposal. How do we measure the extent of bias in AI models? And more importantly, how do we minimize its impact? These are the pressing questions we’ll delve into next, ensuring that as AI continues to evolve, it does so in a direction that is both fair and performant.

group of people smiling

Data is synthetic, but our team is real!

Contact Syntho and one of our experts will get in touch with you at the speed of light to explore the value of synthetic data!