Bias An AI : What Are The Causes And Possible Solutions
The different types of bias present in a data sample arise from incorrect choices during the design phase of the experiment or from the cognitive bias present in the human agent that generates or collects the data. Here’s a list. An artificial intelligence algorithm based on a data-driven approach does not predict the future but encodes the past.
This not only highlights the vital link between the behavior of an algorithm and the database on which it is trained but also poses some questions: how can we define the performance of an algorithm? What are biases, and how can they be neutralized?
What Is Distorted Data?
A sample of data is considered distorted (from English, biased ) if the probability that an individual belongs to a given population depends on the characteristics of that same population, which will then be the object of the inference; in other words, if our sample statistical is not representative of the natural phenomenon that we want to study through the data. For example, if we tried to estimate the distribution of temperatures in a city, but we collected data only when the weather was good (instead of every day at the same time), we would end up with a dataset that does not faithfully represent reality.
Bias: Types And Possible Causes
Different types of bias can be present in a data sample: they usually derive from incorrect choices in the design phase of the experiment or from the cognitive bias present in the human agent that generates or collects the data. One of the most frequent examples is response bias: this type of bias is prevalent in data from the web.
It is introduced by the phenomenon whereby a tiny slice of the population generates most of the data. Recent studies show, for example, that 50% of posts on Facebook and 50% of reviews on Amazon are produced by less than 10% of users. This data sample will not sufficiently represent the opinions of the entire population on these platforms.
Design Of Experiments
How an experiment is conducted plays a significant role in the quality of the data collected, and many types of bias can result from poor choices in this process. Let’s imagine the scenario in which a new product is launched, and reviews are collected for a certain period: What types of bias are likely to be created in a statistical sample?
Let’s assume that we chose to use the first 100 reviews in chronological order: what could happen is that we would only get reviews from the population of loyal customers who bought the product at launch, while we would neglect the opinion of many other user groups; this phenomenon of non-randomization in the choice of data falls under the selection bias.
Also, falling into this family of biases is the scenario where positive or negative reviews are disproportionate to the actual population. This happens because a dissatisfied customer could be more prone to leave a review than a satisfied one. In this case, we are talking about participation bias, which in borderline cases could lead us to manage unbalanced data samples concerning a target variable on which we are interested in making inferences.
Furthermore, in contexts where there is the option to choose from whom to obtain an opinion on the product, there is a risk of running into convergence bias, for example, by choosing to ask for an idea only from people who have purchased the product and not from those who have not.
Bias And Use Of Artificial Intelligence
More often than not, artificial intelligence is a vital source of bias in the data. An example is the so-called feedback loops, frequent (and often inevitable) for algorithms that interact in real time with the user. Let’s put ourselves in the context, for example, of a content recommendation algorithm: once it has formulated, starting from historical data, a content proposal for a user, it will condition him in his choice by the submissions made, often not leading him to evaluate all the alternatives available.
The data that the algorithm will then collect will be conditioned by this, leading it to what is, in effect, a loop where the new data it will observe will be the result of its behavior. In this context, further types of bias can appear in the data due to the influence that the algorithm has on the user: position bias is introduced when a list of contents is presented by emphasizing their order, or the presentation bias, deriving from the fact that the contents could be explained by different types of media ( e.g., video vs. text).
The Transfer From Human ias
However, data collection by human agents can also lead to different types of bias. This happens when you make assumptions about the data based on your experience that don’t apply in more general contexts. Humans may act on prejudices or stereotypes even unintentionally. For example, a person who finds himself having to hire an employee to fulfill the same tasks as himself may tend to prefer those with a similar background to his own.
Problems Induced By Biased Data
A bias in the data will tomorrow transform into a tendency in the algorithm’s output. To date, in many Big Data generated by systems based on AI or Machine learning, there may be biases induced by the underlying algorithms.
Impact On The Performance Of AI Systems
One of the opposing sides of this condition is related to the inference performance of the algorithm. Assuming that I introduce a bias in the selection of the training data (data on which the algorithm is trained), I may not adequately represent the actual population, the one that will then generate the test data (the actual data on which the algorithm then will be called upon to make inference once in production). This misalignment between the data shown to the algorithm for training and the data on which it will then be called upon to make inferences will not allow the latter to generalize the phenomenon, affecting its performance adequately.
The second negative side, as well as the most current one and one of the leading testing grounds of AI, is that of discriminatory behavior towards specific communities. The great success of Machine learning applications has led, in recent years, to a natural proliferation of these algorithms in the most disparate fields.
AI techniques have been introduced in personnel selection processes, security, and financial resource management. These operations are seen as the primary source of historical data, precisely those events recorded in the past by human operators, who may have involuntarily translated their cognitive biases into natural biases in the data.
Today, with a trend that seems to be growing strongly, an algorithm could find itself deciding whether or not we are a suitable candidate for a specific type of job position or whether it is possible to obtain a loan or mortgage. All these decisions, often with a substantial impact on a person’s life, must be taken in a fair and non-discriminatory manner regarding ethnicity, sex, age, or any other human factor.
It is precisely these reasons that lead us to question what the actual performance of an algorithm is: today, to build and evaluate an AI algorithm, it is no longer sufficient to calculate an accuracy metric simply, but it will be necessary to assess the behavior of this last in terms of generalization on data and environments never seen before and respecting the constraints introduced by the paradigm of fairness in AI.
Potential Solutions To The Problem Of Bias In AI
When a company takes on the challenge of introducing artificial intelligence techniques into its business, it must first ask itself whether its data culture is sufficient to ensure the reliable performance of its algorithms. The first step in designing a data-driven AI system is undoubtedly exploratory data analysis (EDA). This process, if done with due depth and critical sense, can lead to the identification of various types of bias in our data and a greater understanding of the generating phenomenon, an indispensable requirement for preventing and correcting bias both in data and in models.
Practical Tips For Building A “Fair” Database
What are the correct policies to follow when building and monitoring a database? Bias prevention is an ongoing process that begins in data pre-processing and ends in output post-processing. A company undertaking an AI project should adhere to good data use practices, such as:
- Do a priori research on the phenomenon that generates the data, trying to understand the “more general” scenario.
- Separate the data analysis team from the data collection team to minimize cognitive biases during analyses.
- Combining data from different configurations of the same phenomenon to obtain greater generality. Returning to the weather example means measuring temperatures under all possible climatic conditions.
- Create a prototype of “ideal data,” i.e., a small sample representative of the phenomenon, and act as a guide during data collection and labeling.
- In the case of manual annotation of data (e.g., classify if a review is positive or negative), do more verification steps on different human users.
- Seek outside help from a domain expert to review the data. An outside eye may notice biases the team may not have seen, influencing each other.
- Use, when possible, randomization in the generation and selection of data.
Management Of Biases Produced By AI
When data is generated by AI systems, an excellent strategy to mitigate the bias induced by feedback loops is to randomize the outputs. For example, in a small portion of the cases (e.g., 1%), we can impose that the algorithm randomly generates a production so you can “explore” new data. This is common in recommender systems, where a new user interest may be discovered that was not observable from the historical data present.
Management Of Bias Produced By Humans
Regarding biases in the content generated by humans, then translated into a biased behavior of the algorithm trained on them, much research has been done in recent years to quantify the level of fairness of an AI system. Recently, many benchmarks and public databases specifically designed to evaluate an algorithm’s fairness have been introduced by the scientific community . The challenge, in these cases, is to improve fairness without sacrificing statistical performance.
One of the most popular techniques to date is the use of a metric that considers both the algorithm’s accuracy and the degree of independence of the output from “critical” variables (e.g., gender, age, ethnicity…). In this paradigm, variables are shown to the algorithm for which, ideally, it should not discriminate: instead of learning, as usually happens, how the output depends on these variables, independence will have to be rewarded in terms of performance metrics of these variables from the algorithm output.
Suppose you face a data sample with unbalancing or a participation bias. In that case, one of the most popular paths relies on generating synthetic data of a particular type to compensate for the proportions. However, it should be noted that while this technique, on the one hand, improves the learning process of the algorithm, on the other, it risks amplifying different types of bias that could be present in the little data used to generate new ones. The management of this trade-off is still an open problem for the scientific literature.
The problem of distorted statistical samples is much more common than one might think and has origins that are often difficult to identify precisely because of their heterogeneity. If neglected, these biases can significantly impact both the performance metrics of your algorithms and the level of fairness of the latter. For this reason, good management policies for its data and machine learning and artificial intelligence algorithms are necessary for any company that decides to invest in these technologies.