Safety Guardrails

Description

Safety guardrails apply real-time content moderation and safety constraints during deployment, generating evidence of safety violations and protection effectiveness. This technique produces detailed logs of blocked content, safety incident reports, violation pattern analysis, and guardrail performance metrics. Evidence includes categorised threat taxonomies, false positive/negative rates, and temporal trends in attack patterns, supporting assurance claims about deployed system safety.

Example Use Cases

Safety

Implementing real-time filtering on a chatbot to catch any harmful outputs that slip through training-time safety alignment, blocking toxic content before it reaches users.

Implementing guardrails on a medical information chatbot to filter queries requesting diagnosis or treatment recommendations that should only come from licensed professionals, and to block outputs containing specific dosage information without proper context and disclaimers.

Ensuring an educational AI tutor blocks inappropriate content and maintains age-appropriate interactions by filtering both student inputs (detecting potential self-harm signals) and system outputs (preventing exposure to mature content).

Security

Protecting a code generation model from responding to prompts requesting malicious code by filtering both inputs and outputs for security vulnerabilities and attack patterns.

Protecting a financial advisory AI from generating outputs that could constitute unauthorised securities advice or recommendations violating regulatory requirements, filtering both prompts and responses for compliance violations.

Reliability

Ensuring reliable content safety in production by adding a filtering layer that catches edge cases missed during testing and provides immediate protection against newly discovered attack patterns.

Limitations

Adds 50-200ms latency to inference depending on filter complexity, which may be unacceptable for real-time applications requiring sub-100ms response times.
Safety classifiers may produce false positives that block legitimate content, degrading user experience and system utility.
Sophisticated adversaries may craft outputs that evade filtering through obfuscation, encoding, or subtle harmful content.
Requires continuous updates to filtering rules and classifiers as new attack patterns and harmful content types emerge.
Running guardrail models alongside primary models increases infrastructure costs by 20-50% depending on guardrail complexity, which may be prohibitive for resource-constrained deployments.

Resources

Research Papers

Legilimens: Practical and Unified Content Moderation for Large Language Model Services

Jialin Wu et al.•Aug 28, 2024

Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Shaona Ghosh et al.•Apr 9, 2024

As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li et al.•Jun 11, 2025

Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.

Software Packages

Guardrails

Apr 18, 2023

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.

Documentations

Real-time Serving — Databricks SDK for Python beta documentation

Databricks-sdk-py Developers•Jan 1, 2023

Related Techniques

Name	Description	Assurance Goals
Hallucination Detection	Hallucination detection identifies when generative models produce factually incorrect, fabricated, or ungrounded outputs. This technique combines automated consistency checking, self-consistency methods, uncertainty quantification, and human evaluation. It produces detection scores distinguishing intrinsic hallucinations (contradicting sources) from extrinsic hallucinations (unverifiable claims), enabling filtering or user warnings.	Reliability Transparency Safety
Out-of-Domain Detection	Out-of-domain (OOD) detection identifies user inputs that fall outside an AI system's intended domain or capabilities, enabling graceful handling rather than generating unreliable responses. This technique trains classifiers or uses embedding-based methods to recognise when queries require knowledge or capabilities the system doesn't possess. OOD detection enables systems to explicitly decline, redirect, or request clarification rather than hallucinating answers or applying inappropriate reasoning to unfamiliar domains.	Reliability Safety Transparency
Reward Hacking Detection	Reward hacking detection identifies when AI systems achieve stated objectives through unintended shortcuts or loopholes rather than genuine task mastery. This technique analyses whether systems exploit reward specification ambiguities, reward function bugs, or simulator artifacts to maximise rewards without actually solving the intended problem. Detection involves adversarial specification testing, human evaluation of solution quality, transfer testing to slightly modified tasks, and analysis of unexpected strategy patterns that suggest reward hacking rather than genuine learning.	Reliability Safety Transparency
Toxicity and Bias Detection	Toxicity and bias detection uses automated classifiers and human review to identify harmful, offensive, or biased content in model outputs. This technique employs tools trained to detect toxic language, hate speech, stereotypes, and demographic biases through targeted testing with adversarial prompts. Detection covers explicit toxicity, implicit bias, and distributional unfairness across demographic groups.	Safety Fairness Reliability