Introduction to how to jailbreak an LLM

BySebastian Krauß 16/05/202416/05/2024

A detailed instruction on how to build a bomb, a hateful speech against minorities in the style of Adolf Hitler or an article that explains why Covid was just made up by the government. These examples of threatening, toxic, or fake content can be generated by AI. To eliminate this, some Large Language Model (LLM) providers put in place safety barriers in order to prevent the generation of harmful output.

However, despite these precautions, there exist techniques that can bypass these safeguards, leading to the generation of malicious content. These techniques are commonly known as ‘jailbreaking’. The goal is to exploit a model’s vulnerabilities and break its defense mechanisms. Attackers often use creativity, technical understanding and iterative testing to find vulnerabilities. In this article, I will explore some of the most common jailbreaking attacks.

1. Role plays

One widely-used technique to alter a model’s behavior is role-play. In this scenario, an attacker presents a detailed character description and instructs the AI model to act accordingly. Frequently, these roles may exhibit toxic or sinister characteristics that contradict the model’s guidelines. The Do Anything Now (DAN) model serves as an example in this category, where the model has no restrictions, akin to the EvilBot model that successfully compromised ChatGPT:

2. Different output format

Another method to manipulate an AI model involves requesting a different type of output. Rather than generating text responses, many models can also produce tables or code. Instructing the model to create this kind of content could potentially result in the generation of sexist or racist material, as illustrated in the example below:

3. Persuasion

Various persuasion techniques can also serve as potential attack strategies. Persuasive adversarial prompts (PAP) can exploit AI models using knowledge derived from social science research. Techniques range from fact-based persuasion to biased strategies such as anchoring and threats. For instance, a user might ask for instructions on building a bomb or establishing a fraud scheme under the guise of improving protective measures against such activities:

In addition to the jailbreaking techniques previously discussed, there exist other methods designed to bypass the security measures of an LLM. One such technique involves leaking the model’s instructions – a strategy that significantly amplifies the potential threat of future attacks. Other types of attacks include token smuggling, undertaking translation tasks, and text completions.

It is essential to identify such risks prior to deployment. The potential damage to society and the company can be significant. Therefore, LLMs must be tested against jailbreaking attacks before they are made available to customers. At Validator, we evaluate potential risks of AI services, including jailbreaking, using our specialized tools and expert knowledge.

References

https://github.com/0xk1h0/ChatGPT_DAN

https://huggingface.co/blog/red-teaming

https://chats-lab.github.io/persuasive_jailbreaker

AI Act | Artificial Intelligence | Blog

Who is who in the EU AI Act?

ByMichael Graf 06/05/2024

The EU AI-Act is here! Published on March 13th, 2024, this game-changing legislation is carving out a new path in AI governance. From the key players to the strategic interactions, understanding the complexities of this ecosystem is crucial for understanding your role in it. With this post, we start our series on the AI Act…

AI Act | Artificial Intelligence | Blog

The Nuances of AI Testing: Learnings from AI red-teaming

ByYunus Bulut 09/04/202412/04/2024

Artificial Intelligence (AI) Testing is a complex field that transcends the boundaries of traditional performance testing. While AI developers are well-versed with performance testing due to its prevalence in the educational system, it is crucial to understand that AI encompasses much more than just performance. In this post, I’d like to list some key principles…

Artificial Intelligence | Benchmark | Blog | Fairness

How unfair are LLMs really? Evidence from Anthropic’s Discrim-Eval Dataset

BySebastian Krauß 09/07/202409/07/2024

Fairness is always an essential criterion for trustworthy and high-quality AI, no matter it’s a credit scoring model, a hiring assistant or a simple chatbot. But what does it mean to have a fair AI? Fairness has several aspects. First, it means all humans should be treated equally. Stereotypes or any other form of prejudice…

Artificial Intelligence | Blog

The Hidden Challenges of LLMs: Tackling Hallucinations

ByAman Kumar 05/06/2024

Large Language Models (LLMs) have recently gained acclaim for their ability to generate highly fluent and coherent responses to user prompts. However, alongside their impressive capabilities come notable flaws. One significant vulnerability is the phenomenon of hallucinations, where the models generate incorrect or misleading information. This issue poses substantial risks, especially in sensitive fields such…

AI Act | Artificial Intelligence | Blog | Machine Learning

Model Validation and Monitoring: New phases in the ML lifecycle

ByYunus Bulut 05/08/202109/04/2024

Validation/testing and monitoring of the ML models might be a luxury in the past. But with the enforcement of the regulations on artificial intelligence, they are now indispensable parts of the machine learning pipeline. In the last decade, machine learning (ML) research and practice have gone a long way in establishing a common framework in designing systems and applications…

Artificial Intelligence | Blog | Use Cases | Validaitor

Bias in Legal Trials: The Role of Validaitor in Enhancing Judicial Fairness

ByCem Daloglu 17/06/2024

Legal trials epitomize fairness and justice, where everyone is treated equally before the law. However, conscious and unconscious biases can infiltrate the judicial process, affecting outcomes and undermining public trust. With the advent of technology, particularly large language models, there’s potential to address these biases, but it comes with its challenges. The Presence of Bias…