Jailbreak Resistance Testing
Description
Jailbreak resistance testing evaluates LLM defences against techniques that bypass safety constraints. This involves testing role-playing attacks, hypothetical framing, encoded instructions, and multi-turn manipulation. Testing measures immediate resistance to jailbreaks and the system's ability to detect and block evolving attack patterns, producing reports on vulnerability severity and exploit paths.
Example Use Cases
Safety
Testing a mental health support chatbot to ensure it cannot be jailbroken into providing medical advice that contradicts established clinical guidelines or suggesting harmful interventions, even when users employ emotional manipulation or role-playing scenarios.
Security
Validating that a financial advisory AI cannot be manipulated through multi-turn conversations into revealing proprietary trading algorithms, internal risk assessment models, or client portfolio information through social engineering techniques.
Testing a legal research AI assistant to verify it cannot be jailbroken into generating legally problematic content, revealing confidential case strategies, or providing advice that contradicts professional ethics rules through iterative prompt refinement.
Reliability
Ensuring an educational assessment AI maintains reliable grading standards and cannot be convinced to inflate scores, provide test answers, or bypass academic integrity checks through creative prompt engineering or hypothetical framing.
Limitations
- New jailbreak techniques emerge constantly as adversaries discover creative attack vectors, requiring continuous testing and defense updates.
- Overly restrictive defenses can cause false positive rates of 5-15%, blocking legitimate queries about sensitive topics for educational or research purposes.
- Testing coverage is inherently limited by the creativity of testers, potentially missing novel jailbreak approaches that real users might discover.
- Some jailbreaks work through subtle multi-turn interactions that are difficult to anticipate and test systematically.
- Defence mechanisms such as output filtering, multi-stage validation, and adversarial prompt detection add 100-300ms latency per response, which may impact user experience in real-time applications.
- Defining clear boundaries for what constitutes unacceptable behaviour versus legitimate edge case queries is context-dependent and culturally variable, making universal jailbreak resistance metrics difficult to establish.
Resources
Research Papers
Jailbroken: How Does LLM Safety Training Fail?
Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.