Jailbreak Resistance Testing

Description

Jailbreak resistance testing evaluates LLM defences against techniques that bypass safety constraints. This involves testing role-playing attacks, hypothetical framing, encoded instructions, and multi-turn manipulation. Testing measures immediate resistance to jailbreaks and the system's ability to detect and block evolving attack patterns, producing reports on vulnerability severity and exploit paths.

Example Use Cases

Safety

Testing a mental health support chatbot to ensure it cannot be jailbroken into providing medical advice that contradicts established clinical guidelines or suggesting harmful interventions, even when users employ emotional manipulation or role-playing scenarios.

Security

Validating that a financial advisory AI cannot be manipulated through multi-turn conversations into revealing proprietary trading algorithms, internal risk assessment models, or client portfolio information through social engineering techniques.

Testing a legal research AI assistant to verify it cannot be jailbroken into generating legally problematic content, revealing confidential case strategies, or providing advice that contradicts professional ethics rules through iterative prompt refinement.

Reliability

Ensuring an educational assessment AI maintains reliable grading standards and cannot be convinced to inflate scores, provide test answers, or bypass academic integrity checks through creative prompt engineering or hypothetical framing.

Limitations

New jailbreak techniques emerge constantly as adversaries discover creative attack vectors, requiring continuous testing and defense updates.
Overly restrictive defenses can cause false positive rates of 5-15%, blocking legitimate queries about sensitive topics for educational or research purposes.
Testing coverage is inherently limited by the creativity of testers, potentially missing novel jailbreak approaches that real users might discover.
Some jailbreaks work through subtle multi-turn interactions that are difficult to anticipate and test systematically.
Defence mechanisms such as output filtering, multi-stage validation, and adversarial prompt detection add 100-300ms latency per response, which may impact user experience in real-time applications.
Defining clear boundaries for what constitutes unacceptable behaviour versus legitimate edge case queries is context-dependent and culturally variable, making universal jailbreak resistance metrics difficult to establish.

Resources

Research Papers

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt•Jul 5, 2023

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu et al.•Sep 19, 2023

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Apurv Verma et al.•Jul 20, 2024

Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.

Software Packages

llamator

Sep 5, 2024

Framework for testing vulnerabilities of large language models (LLM).

walledeval

Apr 27, 2024

Test LLMs against jailbreaks and unprecedented harms

Related Techniques

Name	Description	Assurance Goals
Red Teaming	Red teaming is a structured adversarial evaluation process in which a dedicated team systematically probes an AI/ML system to discover vulnerabilities and failure modes. Unlike individual testing techniques (such as prompt injection testing, jailbreak resistance testing, or safety envelope testing), red teaming defines the organisational methodology: team composition, campaign planning, attack taxonomy, and reporting structure. The red team coordinates diverse attack vectors — including prompt manipulation, adversarial examples, edge case exploitation, and social engineering scenarios — into a coherent assessment programme. Results feed into risk registers and remediation plans rather than pass/fail metrics.	General
Prompt Sensitivity Analysis	Prompt Sensitivity Analysis systematically evaluates how variations in input prompts affect large language model outputs, providing insights into model robustness, consistency, and interpretability. This technique involves creating controlled perturbations of prompts whilst maintaining semantic meaning, then measuring how these changes influence model responses. It encompasses various types of prompt modifications including lexical substitutions, syntactic restructuring, formatting changes, and contextual variations. The analysis typically quantifies sensitivity through metrics such as output consistency, semantic similarity, and statistical measures of variance across prompt variations.	Explainability Reliability Safety
Adversarial Training Evaluation	Adversarial training evaluation assesses whether models trained with adversarial examples have genuinely improved robustness rather than merely overfitting to specific attack methods. This technique tests robustness against diverse attack algorithms including those not used during training, measures certified robustness bounds, and evaluates whether adversarial training creates exploitable trade-offs in clean accuracy or introduces new vulnerabilities. Evaluation ensures adversarial training provides genuine security benefits rather than superficial improvements.	Security Reliability Transparency
Prompt Injection Testing	Prompt injection testing systematically evaluates LLMs and generative AI systems for vulnerabilities where malicious users can override system instructions through carefully crafted inputs. This technique encompasses both offensive security testing (proactively attempting injection attacks to identify vulnerabilities) and defensive detection mechanisms (monitoring for suspicious patterns in prompts and context). Testing involves evaluating various injection patterns including direct instruction manipulation, context confusion, role-playing exploits, delimiter attacks, and context window poisoning. Detection methods analyse context provenance, validate formatting consistency, and identify semantic anomalies that indicate manipulation attempts rather than legitimate user interaction.	Security Safety Reliability