| | |

How unfair are LLMs really? Evidence from Anthropic’s Discrim-Eval Dataset

Fairness is always an essential criterion for trustworthy and high-quality AI, no matter it’s a credit scoring model, a hiring assistant or a simple chatbot. But what does it mean to have a fair AI? Fairness has several aspects. First, it means all humans should be treated equally. Stereotypes or any other form of prejudice should be prevented. Exclusion or disadvantage in any form as for example sexism, racism or homophobia are avoided. So, coming back to the previous examples, a credit scoring model that rejects black people, a hiring assistant that only hires men and a discriminating chatbot are unfair examples of AI. In addition to the fact that this violates human rights and laws, it can also lead to reputational damage for a company.

How to measure fairness

There are multiple options for analyzing how fair an AI model is. One option is to run tests on the model and evaluate its results. Regarding Large Language Models (LLMs) you can send prompts and analyze the answers. Discrim-Eval1 is a prompt collection provided by Anthropic.  It contains decision questions that cover topics like loan approval, organ transplantation or visa. Each prompt describes a person with three demographic attributes: gender (male, female, non-binary), age (ranging from 20 to 100) and race (white, Black, Asian, Hispanic, Native American). The model is asked to take a binary decision (yes/ no) on the provided decision scenario as in the example below.

All 18900 prompts are designed so that a ‘yes’ response is advantageous to the certain person. This test design makes it possible to test for potential discrimination against demographic groups. By comparing the responses for each of the 70 decision scenarios, discrimination against certain demographic groups can be measured. This indicates the fairness of the tested AI model.

We tested some of the popular LLMs that can be deployed on Azure or AWS Bedrock. These include models from OpenAI, Meta, Mistral, Cohere and Amazon. Some of them were tested on both platforms to see if there is any difference. The accuracy plot below shows the average approval by each model. Since we expect an approval for each scenario, we call this accuracy. The accuracy range is from medium (Llama2 13B Chat and Claude Instant V1) to high (Cohere Command and Mistal 7B Instruct).

Even more interesting from a fairness perspective is how much the subgroups differ from each other. This means that we want to compare the approval rate for different gender, age and race groups. Demographic Parity Ratio (DPR) is a metric that shows how independent a classification is on different demographic groups. Equalized Odds Ratio (EOR) is a similar metric to DPR and shows the independency on demographic groups as well. On top of that it evaluates if the different groups have the same false and true positive rates. In this case however, DPR and EOR have the same values. A value of 1 means an equal performance for all groups. The bar charts below show the EOR for different subgroups in gender, age, and race.

Learnings

  • Most LLMs have high EORs. In one case it is even 1 which means that this model performs absolutely equally for the specific subgroups regarding this dataset. High EOR values are an indicator for a fair and balanced LLM.
  • The accuracy range is much wider than for EOR. Models with a low EOR are more critical and approve more rarely. Another explanation are abstentions. In some cases, the models explain that they can’t take a decision based on the provided information.
  • The performance can be different on Azure and AWS Bedrock. Sometimes the same model performs differently on both platforms. This is either caused by different internal instructions and safeguards or depends on the version. In this paper2 researcher found out that a model performance varies over time.
  • High scores are a good sign but can be elusive. Since this collection is publicly available, it can be used as training data. We don’t have any proof that this is the case, but we expect that at least some of the models were trained on these questions.
  • Anthropic Discrim-Eval is a good start for evaluating fairness in LLMs. It follows a methodology to test if the model is biased by comparing decisions for different groups. That being said, fairness evaluation is challenging and has a lot of facades. Hence, using this dataset alone is not sufficient to evaluate fairness comprehensively. It requires different approaches and techniques to detect hidden bias or discrimination within AI models.

At Validaitor, we use public benchmarks like Discrim-Eval for testing LLMs. This is part of our out-of-the-box tests that we use in order to ensure that AI is fair, unbiased and non-discriminatory. On top of that, we evaluate LLM applications by testing them with use-case specific prompts and our analysis tools.

References

  1. Anthropic Discrim-Eval, https://huggingface.co/datasets/Anthropic/discrim-eval
  2. How Is ChatGPT’s Behavior Changing over Time?, https://arxiv.org/pdf/2307.09009

Similar Posts