Constitutional AI Evaluation
Description
Constitutional AI evaluation assesses models trained to follow explicit behavioural principles or 'constitutions' that specify desired values and constraints. This technique verifies that models consistently adhere to specified principles across diverse scenarios, measuring both principle compliance and handling of principle conflicts. Evaluation includes testing boundary cases where principles might conflict, measuring robustness to adversarial attempts to violate principles, and assessing whether models can explain their reasoning in terms of constitutional principles.
Example Use Cases
Safety
Evaluating a general-purpose AI assistant with the principle 'provide helpful information while refusing requests for illegal activities' to ensure it consistently distinguishes between legitimate chemistry education queries and attempts to obtain instructions for synthesising controlled substances or explosives.
Evaluating a mental health support chatbot designed with principles like 'be empathetic and supportive while never providing medical diagnoses or replacing professional care' to verify it consistently maintains this boundary across varied emotional crises and user requests.
Transparency
Testing whether an AI assistant can transparently explain its decisions by referencing the specific constitutional principles that guided its responses, enabling users to understand value-based reasoning.
Reliability
Verifying that an AI news aggregator consistently applies stated principles about neutrality versus editorial perspective across diverse political topics and international news sources.
Testing an AI homework assistant built on principles of 'guide learning without providing direct answers' to ensure it maintains this educational philosophy across different subjects, student ages, and question complexities without reverting to simply solving problems.
Limitations
- Constitutional principles may be vague or subject to interpretation, making it difficult to objectively measure compliance.
- Principles can conflict in complex scenarios, and evaluation must assess whether the model's priority ordering matches intended values.
- Models may learn to superficially cite principles in explanations without genuinely using them in decision-making processes.
- Creating comprehensive test sets covering all relevant principle applications and edge cases requires significant domain expertise and resources.
- Constitutional principles that seem clear in one cultural or linguistic context may be interpreted differently across diverse user populations, making it challenging to evaluate whether the model appropriately adapts its principle application or inappropriately varies its values.
- As organisational values evolve or principles are refined based on deployment experience, re-evaluation becomes necessary, but comparing results across principle versions is challenging without confounding changes in the model itself versus changes in evaluation criteria.
Resources
Research Papers
C3AI: Crafting and Evaluating Constitutions for Constitutional AI
Constitutional AI (CAI) guides LLM behavior using constitutions, but identifying which principles are most effective for model alignment remains an open challenge. We introduce the C3AI framework (Crafting Constitutions for CAI models), which serves two key functions: (1) selecting and structuring principles to form effective constitutions before fine-tuning; and (2) evaluating whether fine-tuned CAI models follow these principles in practice. By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles. In a safety alignment use case, we applied a graph-based principle selection method to refine an existing CAI constitution, improving safety measures while maintaining strong general reasoning capabilities. Interestingly, fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results. This highlights a potential gap between principle design and model adherence. Overall, C3AI provides a structured and scalable approach to both crafting and evaluating CAI constitutions.
Towards Principled AI Alignment: An Evaluation and Augmentation of Inverse Constitutional AI
Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B
As language models continue to grow larger, the cost of acquiring high-quality training data has increased significantly. Collecting human feedback is both expensive and time-consuming, and manual labels can be noisy, leading to an imbalance between helpfulness and harmfulness. Constitutional AI, introduced by Anthropic in December 2022, uses AI to provide feedback to another AI, greatly reducing the need for human labeling. However, the original implementation was designed for a model with around 52 billion parameters, and there is limited information on how well Constitutional AI performs with smaller models, such as LLaMA 3-8B. In this paper, we replicated the Constitutional AI workflow using the smaller LLaMA 3-8B model. Our results show that Constitutional AI can effectively increase the harmlessness of the model, reducing the Attack Success Rate in MT-Bench by 40.8%. However, similar to the original study, increasing harmlessness comes at the cost of helpfulness. The helpfulness metrics, which are an average of the Turn 1 and Turn 2 scores, dropped by 9.8% compared to the baseline. Additionally, we observed clear signs of model collapse in the final DPO-CAI model, indicating that smaller models may struggle with self-improvement due to insufficient output quality, making effective fine-tuning more challenging. Our study suggests that, like reasoning and math ability, self-improvement is an emergent property.